Fusing Domain Knowledge with Data : Applications in Bioinformatics

Abstract: Massively parallel measurement techniques can be used for generating hypotheses about the molecular underpinnings of a biological systems. This thesis investigates how domain knowledge can be fused to data from different sources in order to generate more sophisticated hypotheses and improved analyses. We find our applications in the related fields of cell cycle regulation and cancer chemotherapy. In our cell cycle studies we design a detector of periodic expression and use it to generate hypotheses about transcriptional regulation during the course of the cell cycle in synchronized yeast cultures as well as investigate if domain knowledge about gene function can explain whether a gene is periodically expressed or not. We then generate hypotheses that suggest how periodic expression that depends on how the cells were perturbed into synchrony are regulated. The hypotheses suggest where and which transcription factors bind upstreams of genes that are regulated by the cell cycle. In our cancer chemotherapy investigations we first study how a method for identifiyng co-regulated genes associated with chemoresponse to drugs in cell lines is affected by domain knowledge about the genetic relationships between the cell lines. We then turn our attention to problems that arise in microarray based predictive medicine, were there typically are few samples available for learning the predictor and study two different means of alleviating the inherent trade-off betweeen allocation of design and test samples. First we investigate whether independent tests on the design data can be used for improving estimates of a predictors performance without inflicting a bias in the estimate. Then, motivated by recent developments in microarray based predictive medicine, we propose an algorithm that can use unlabeled data for selecting features and consequently improve predictor performance without wasting valuable labeled data.