Enabling massive genomic and transcriptomic analysis

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: In recent years there have been tremendous advances in our ability to rapidly and cost-effectively sequence DNA. This has revolutionized the fields of genetics and biology, leading to a deeper understanding of the molecular events in life processes. The rapid advances have enormously expanded sequencing opportunities and applications, but also imposed heavy strains on steps prior to sequencing, as well as the subsequent handling and analysis of the massive amounts of sequence data that are generated, in order to exploit the full capacity of these novel platforms. The work presented in this thesis (based on six appended papers) has contributed to balancing the sequencing process by developing techniques to accelerate the rate-limiting steps prior to sequencing, facilitating sequence data analysis and applying the novel techniques to address biological questions. Papers I and II describe techniques to eliminate expensive and time-consuming preparatory steps through automating library preparation procedures prior to sequencing. The automated procedures were benchmarked against standard manual procedures and were found to substantially increase throughput while maintaining high reproducibility. In Paper III, a novel algorithm for fast classification of sequences in complex datasets is described. The algorithm was first optimized and validated using a synthetic metagenome dataset and then shown to enable faster analysis of an experimental metagenome dataset than conventional long-read aligners, with similar accuracy. Paper IV, presents an investigation of the molecular effects on the p53 gene of exposing human skin to sunlight during the course of a summer holiday. There was evidence of previously accumulated persistent p53 mutations in 14% of all epidermal cells. Most of these mutations are likely to be passenger events, as the affected cell compartments showed no apparent growth advantage. An annual rate of 35,000 novel sun-induced persistent p53 mutations was estimated to occur in sun-exposed skin of a human individual.  Paper V, assesses the effect of using RNA obtained from whole cell extracts (total RNA) or cytoplasmic RNA on quantifying transcripts detected in subsequent analysis. Overall, more differentially detected genes were identified when using the cytoplasmic RNA. The major reason for this is related to the reduced complexity of cytoplasmic RNA, but also apparently due (at least partly) to the nuclear retention of transcripts with long, structured 5’- and 3’-untranslated regions or long protein coding sequences. The last paper, VI, describes whole-genome sequencing of a large, consanguineous family with a history of Leber hereditary optic neuropathy (LHON) on the maternal side. The analysis identified new candidate genes, which could be important in the aetiology of LHON. However, these candidates require further validation before any firm conclusions can be drawn regarding their contribution to the manifestation of LHON.