Phasing single DNA molecules with barcode linked sequencing

Abstract: Elucidation of our genetic constituents has in the past decade predominately taken the form of short-read DNA sequencing. Revolutionary technology developments have enabled vast amounts of biological information to be obtained, but from a medical standpoint it has yet to live up to the promise of associating individual genotypes to phenotypic states of wide-spread clinical relevance. The mechanisms by which complex phenotypes arise have been difficult to ascertain and the value of short-read sequencing platforms have been limited in this regard. It has become evident that resolving the full spectrum of genetic heterogeneity requires accurate long range information of individual haplotypes to be distinguished. Long-range haplotyping information can be obtained experimentally by long-read sequencing platforms or through linkage of short sequencing reads by means of a common barcode. This thesis explores these solutions, primarily through the development of novel technologies to phase short sequences of single molecules using DNA barcoding. A new method for high-throughput phasing of single DNA molecules, achieved by the production and utilization of uniquely barcoded beads in emulsion droplets, is described in Paper I. The results confirm that complex libraries of beads featuring mutually exclusive barcodes can be generated through clonal PCR amplification, and that these beads can be used to phase variations of the 16s rRNA gene which reduces the ambiguity of classifying bacterial species for metagenomics. Paper II describes a second methodology (‘Droplet Barcode Sequencing’) which simplifies the concept of barcoding DNA fragments by omitting the need for beads and instead relying on clonal amplification of single barcoding oligonucleotides. This study also increases the amount of information that can be linked, which is showcased by phasing all exons of the HLA-A gene and successfully resolving all the alleles present in a sample pool of eight individuals. Paper III expands on this work and explores the use of a single molecule sequencing platform to provide full-length sequencing coverage of six genes of the HLA family. The results show that while genes shorter than 10 kb can be resolved with a high degree of accuracy, compensating for a relatively high error rate by means of increased coverage can be challenging for larger genomic loci. Finally, Paper IV introduces the use of barcode-linked reads on an unprecedented scale, with a new assay that enables low-cost haplotyping of whole genomes without the need for predetermined capture sequences. This technology is utilized to generate a haplotype-resolved human genome, call large-scale structural variants and perform reference-free assembly of bacterial and human genomes. At a cost of only $19 USD per sample, this technology makes the benefits of long-range haplotyping available to the vast majority of laboratories which currently rely solely on short-read sequencing platforms.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)