Exploring human variations by droplet barcoding

Abstract: Biological variations are being explored at ever-increasing rates through the rapid advancement of analytical techniques. Techniques like massively parallel sequencing empower scientists to accurately differentiate individuals’ genetic compositions, cellular functionalities, and healthy tissue from diseased. The knowledge gained from these techniques brings us ever closer to grasping the complexities of life, contributing to human development. Still, to fully elucidate biological variations in different samples requires novel sensitive and high- throughput techniques, capable of placing everything in its correct context. One such technique gaining promise is droplet barcoding. Droplet barcoding leverages emulsion droplets to segregate samples into their functional components, coupled with barcodes that can group tagged molecules following sequencing. This technique constitutes a versatile tool for studying biological variations in both the phenotype and genotype. This thesis leverages droplet barcoding to explore variations relating to human biology. Droplet barcoding was used to study phenotype variations, looking at protein compositions in single extracellular vesicles (Paper I) and single cells (Paper II). Paper I studies extracellular vesicles which are naturally released from cells. They carry heterogeneous protein signatures that can inform about their cellular origin. Tens of thousands of extracellular vesicles were profiled, including approximately 25,000 from lung cancer patients. From these protein profiles, extracellular vesicles could be grouped into putative subtypes. Paper II presents a novel method for studying single cells which was used to characterize blood-derived immune cells. The method enabled the identification of most major immune cell lineages. Haplotype-resolved genetic variations were analyzed using a linked read sequencing method based on droplet barcoding. Linked-read sequencing conserves long-range information from short-read sequencing by co- barcoding subsections of long DNA fragments. Paper III presents an open-source pipeline (BLR) for whole genome haplotyping using linked reads. BLR generates accurate and continuous haplotypes, outperforming PacBio HiFi-based diploid assembly. We further show that integration with low-coverage long-read data can improve phasing accuracy in tandem repeats. With 10X Genomics linked reads, BLR generated more continuous haplotypes compared to other workflows. Paper IV applies linked read sequencing to reveal the haplotype complexities of cancer genomes. In two patients with colorectal cancer, we identified several large-scale aberrations impacting cancer-related genes. Additionally, several short somatic variants were found to impact nearly all oncogenic networks identified by TCGA. Demonstrating the importance of haplotype-resolved analysis for cancer genomics, one patient exhibited two nonsense mutations on separate haplotypes in the well-known colorectal cancer gene APC. 

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)