Statistically Guided Visualization and Exploratory Analysis of Omics Data

Abstract: This thesis deals with methods for extracting robust and relevant information from high-dimensional data sets, and statistically guided visualization methods for representing the data in an informative and easily accessible way. High-dimensional data sets are becoming increasingly prevalent in many different scientific disciplines. In this thesis, we focus particularly on so called "omics" data. The "omics" suffix is often used to represent biological research fields where the aim is to study relations and interactions within entire systems of biological entities, such as genes or proteins. The thesis is based on five papers. In the first two papers, we develop a method for stabilizing rankings of variables or variable sets obtained from an experiment. The stabilization effect is achieved by incorporating information concerning the exchangeability of variable pairs into the ranking. We propose a general framework for representation of variable lists, into which the variable pair exchangeabilities can be easily incorporated and which allows straightforward comparison of any two lists. In the third paper, we consider relevant dimension reduction of high-dimensional data sets and propose a new dissimilarity measure which can be used within the Multidimensional Scaling framework to obtain a low-dimensional representation of a data set. The proposed dissimilarity measure treats the variables and experimental units of the data jointly and symmetrically and yields a low-dimensional representation where patterns encoded by small groups of variables or units are more readily visible than with conventional methods such as Principal Component Analysis. The fourth paper provides a straightforward and intuitively appealing criterion for variable subset evaluation in the context of visualization. Finally, in the fifth paper we apply multivariate, correlation-based algorithms to integrate different types of high-dimensional genomic data. We show that by shifting the focus from maximizing the covariance toward maximizing the correlation between the extracted patterns we can extract more biologically relevant knowledge. The focus shift is made possible by considering the dual formulation of the applied methods which in this case is more computationally efficient.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.