Improved statistical methodology for high-throughput omics data analysis

Abstract: Over the last two decades, the advent of high-throughput omics technology has substantially revolutionized biological and biomedical research. A large volume of omics data has been produced with the rapid development of sequencing techniques. Meanwhile, researchers have developed a wide range of computational tools to manage and analyze the omics data. Although the implementation of these tools generates significant discoveries, processing and interpreting the omics data efficiently and accurately is still a big challenge. In this thesis, we aim to develop novel statistical methodologies and algorithms for omics data analysis. We implement the methods for both simulated and real data from different types of cancers. Based on the evaluation and comparison with existing tools, we find that our methods achieve higher accuracy and better performance in analyzing different types of omics data. In Study I, we build an analysis pipeline to integrate multiple levels of omics data and identify potential driver genes in neuroblastoma. The pipeline employs gene expression profile, microarray-based comparative genomic hybridization data, and functional gene interaction network to detect cancer-related driver genes. We identify a total of 66 patient-specific and four common driver genes. The genes are summarized into a driver-gene score (DGscore) for each patient. We find that the patients with a low DGscore have better survival than those with a high DGscore (p-value=0.006). In Study II, we develop a novel method named XAEM to quantify isoformlevel expression using RNA sequencing data. There are two major components in this method. First, we construct a design matrix X as the starting parameter in the quantification model. Second, we utilize an alternating Expectation Maximization algorithm to estimate the design matrix X and isoform expression b iteratively. We compare XAEM with several quantification methods using both simulated and real data. The result shows that XAEM achieves higher accuracy in multipleisoform genes and obtains substantially better rediscovery rates in the differentialexpression analysis. In Study III, we extend the algorithm from Study II and develop an approach named MAX to quantify mutant-allele expression at the isoform level. For a given gene and a list of mutations, we first generate the mutant reference by incorporating all possible mutant isoforms from the wild-type isoform. The alternating Expectation Maximization algorithm is then applied to estimate the isoform abundance. We implement MAX to a real dataset of acute myeloid leukemia. Using the mutant-allele expression, we discover a subgroup of NPM1-mutated patients that has better drug response to a kinase inhibitor. In Study IV, we build a pipeline to detect fusion genes at DNA level using whole-exome sequencing data. The pipeline is utilized to three comprehensive datasets of acute myeloid leukemia and prostate cancer patients. Compared with the detection results from RNA sequencing data, we find that several major fusion events in these two cancer types are validated in some of the patients. However, the overall results indicate that it is challenging to identify chimeric genes using exome sequencing data due to its inherent limitations. Altogether, we have developed several statistical and bioinformatics tools to analyze different types of omics data, which demonstrate higher accuracy and better performance than other competing approaches. The results in this thesis will provide novel insights into omics data analysis and facilitate significant discoveries in cancer research.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.