Statistical assessment of genomic variability in tumours and bacterial communities

Abstract: Current high-throughput DNA sequencing technologies have the ability to generate large amounts of high-resolution genomic data. The high dimensionality in combination with the substantial levels of technical errors and biological variability typically present in the data make, however, the analysis challenging. Tailored statistical methods are therefore crucial for reaching valid biological conclusions. In this thesis, such methods were developed and applied to address research questions in biology and medicine. First, a method for identification of tumour-specific (somatic) mutations was developed, which included steps for noise-reduction, sensitive detection of  DNA alterations and removal of systematic errors. In Paper I, the method was applied to exome-sequenced paired normal–tumour samples from pheochromocytoma patients. A significantly higher mutation rate was found in malignant compared to benign tumours and three genes with recurrent somatic mutations, exclusively located in malignant tumours, were identified. In paper II and III, somatic mutations were identified in patients with acute myeloid leukemia and evaluated as biomarkers in personalised deep sequencing analysis of remaining cancer cells after treatment. In paper III, a statistical model correcting for position-specific errors in the data was developed and shown to provide superior sensitivity compared to standard techniques. In paper IV, clinically relevant molecular subgroups of metastatic small intestinal neuroendocrine tumours were identified based on miRNA gene expression profiles. Survival analysis and subsequent validation suggested miR-375 as a prognostic biomarker. In paper V, a hierarchical Bayesian model for detecting differences on nucleotide level between microbial communities is proposed. By including between-sample variability and utilizing a shrinkage approach, the model was able to perform well both in cases of few samples and high biological variability. Finally, the model was used to detect antibiotic resistance mutations in bacteria. This thesis demonstrates that dedicated statistical analysis and knowledge of the underlying error structure present in high-dimensional biological data is of importance for enabling accurate interpretation and sound conclusions.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)