Patterns in big data bioinformatics : Understanding complex diseases with interpretable machine learning

Abstract: Alterations in the flow of genetic information may lead to complex diseases. Such changes are measured with various omics techniques that usually produce the so-called “big data”. Using interpretable machine learning (ML), we retrieved patterns from transcriptomics data sets. Specifically, we employed a rule-based ML to identify associations among features and a decision in a combinatorial manner, i.e. a co-prediction. We developed tools and methods that can be applied by a large community of bioinformaticians and proved their usability through a variety of studies.In paper I, we developed an R.ROSETTA package that provides an environment for rule-based ML relying on the rough sets. Basically, R.ROSETTA is an R wrapper of the ROSETTA toolkit; however, it extends its functions with various analytical solutions. The package was tested on a microarray gene expression case-control study of autism. Estimated models were highly accurate and provided lists of possible interactions among genes. Moreover, benchmarking revealed that R.ROSETTA was among the best performing rule- and decision tree-based methods.In paper II, we applied the R.ROSETTA together with a VisuNet package. We used both tools to perform a rule-based network analysis of autism spectrum disorder (ASD) subtypes. Here, we used microarray-based gene expression measures of ASD patients and controls from three data sets. We demonstrated that rule-based modelling is an efficient approach to merge multiple cohorts. Furthermore, we estimated centrality distances among produced subnetworks that revealed dissimilarities of ASD subtypes and controls. Finally, we discovered a highly probable interaction between EMC4 and TMEM30A genes.In paper III, we investigated our tools to perform an RNA-seq-based gene expression analysis of Acute Myeloid Leukemia (AML). We aimed at discovering gene expression patterns between the AML diagnosis and relapse. Specifically, we applied a rule-based network analysis to validate independent cohorts. Our study revealed that overexpressed CD6 and underexpressed INSR are highly co-predictive genes associated to the AML relapse. Finally, we demonstrated arc diagrams as a novel way of visualizing co-predictors.In paper IV, we analyzed glioma grading by performing a comprehensive ML analysis for RNA-seq data sets. We broadly preprocessed data sets and removed a strong batch effect that occurred between glioma grades. Afterwards, we performed ML evaluation on single-sample gene set enrichment scores that revealed topmost accurate collections and annotations that distinguish glioma grades. Among others, we found cell cycle, Fanconi anemia and cholesterol-related pathways associated to glioma progression. Finally, we discovered several co-enrichment mechanisms among annotations.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.