A multivariate approach to computational molecular biology

University dissertation from Umeå : Kemi

Abstract: This thesis describes the application of multivariate methods in analyses of genomic DNA sequences, gene expression and protein synthesis, which represent each of the steps in the central dogma of biology. The recent finalisation of large sequencing projects has given us a definable core of genetic data and large-scale methods for the dynamic quantification of gene expression and protein synthesis. However, in order to gain meaningful knowledge from such data, appropriate data analysis methods must be applied.The multivariate projection methods, principal component analysis (PCA) and partial least squares projection to latent structures (PLS), were used for clustering and multivariate calibration of data. By combining results from these and other statistical methods with interactive visualisation, valuable information was extracted and further interpreted.We analysed genomic sequences by combining multivariate statistics with cytological observations and full genome annotations. All oligomers of di- (16), tri- (64), tetra- (256), penta- (1024) and hexa-mers (4096) of DNA were separately counted and normalised and their distributions in the chromosomes of three Drosophila genomes were studied by using PCA. Using this strategy sequence signatures responsible for the differentiation of chromosomal elements were identified and related to previously defined biological features. We also developed a tool, which has been made publicly available, to interactively analyse single nucleotide polymorphism data and to visualise annotations and linkage disequilibrium.PLS was used to investigate the relationships between weather factors and gene expression in field-grown aspen leaves. By interpreting PLS models it was possible to predict if genes were mainly environmentally or developmentally regulated. Based on a PCA model calculated from seasonal gene expression profiles, different phases of the growing season were identified as different clusters. In addition, a publicly available dataset with gene expression values for 7070 genes was analysed by PLS to classify tumour types. All samples in a training set and an external test set were correctly classified. For the interpretation of these results a method was applied to obtain a cut-off value for deciding which genes could be of interest for further studies.Potential biomarkers for the efficacy of radiation treatment of brain tumours were identified by combining quantification of protein profiles by SELDI-MS-TOF with multivariate analysis using PCA and PLS. We were also able to differentiate brain tumours from normal brain tissue based on protein profiles, and observed that radiation treatment slows down the development of tumours at a molecular level.By applying a multivariate approach for the analysis of biological data information was extracted that would be impossible or very difficult to acquire with traditional methods. The next step in a systems biology approach will be to perform a combined analysis in order to elucidate how the different levels of information are linked together to form a regulatory network.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.