Computational exploration of human genome variation

University dissertation from Stockholm : Karolinska Institutet, Center for Genomics Research

Abstract: In studies of human genome variation, researchers attempt to identify the DNA sequence differences between our genomes that contribute substantially to variation that can be observed on a physical level (phenotype). Genetic variation can also be used to study population dynamics in human history, and how evolutionary forces have shaped the human genome. These endeavors require comprehensive data resources and computational tools to facilitate directed data generation on the scale necessary for detection of biologically relevant signals. Data management systems are also necessary to store, compare and interpret the collective data masses created by researchers in the field. This thesis describes the development of computational resources and algorithms for improving the efficiency of studies into human genome variation with a focus on single nucleotide polymorphism (SNP). Issues addressed are: i) databases emphasizing quality assurance, improved annotation and portable formats, ii) assay design for improved PCR and hybridization reactions through a consideration of DNA secondary structure, iii) SNP selection using a combination of in silico methods for prediction of the functional impact of SNPs and evidence of positive selection to identify sequence differences that may be disruptive to a living cell, and possibly cause disease, iv) genome structure in a comprehensive study into the dynamics of duplicated segments and how they affect SNP genotyping. Building on previous work around a database for human sequence variation within genes (Hgbase), a scalable database and accompanying portable data formats for all human sequence variation was developed (HGVbase). Following the availability of the complete human genome draft, the sequence variations were layered on top of this scaffold and annotation and search capabilities were vastly improved. Going forward, systems were constructed to capture genotypes, haplotypes, phenotypes and the (complex) relations between them. This new information increases our ability to extract and prioritize biologically interesting subsets of data. In conjunction with new genotyping technology, high-throughput genotyping assay design software facilitated studies encompassing large numbers of SNPs. This capacity was leveraged in creating a set of validated SNP markers in coding regions of genes for subsequent use in studies into disease genetics and population genetics. The genotyping technology was developed further to enable a genome-wide study of single base variants in duplicated segments. The study led to the discovery of a form of sequence variation that we termed "Multi-site Variation" (MSV). This explains a large fraction of the observed increase of predicted polymorphism in duplicated sequence and indicates considerable copy number variation in the human genome. MSVs are able to masquerade as SNPs when genotyped with standard methods in population samples. Unfortunately this is not always revealed by Hardy-Weinberg disequilibrium considerations or mendelian inheritance tests.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.