In silico prediction of CIS-regulatory elements

University dissertation from Stockholm : Karolinska Institutet, Center for Genomics Research

Abstract: As one of the most fundamental processes for all life forms, transcriptional regulation remains an intriguing and challenging subject for biomedical research. Experimental efforts towards understanding the regulation of genes is laborious and expensive, but can be substantially accelerated with the use of computational predictions. The growing number of fully sequenced metazoan genomes in combination with the increasing use of high-throughput methods such as microarrays has increased the necessity of combining computational methods with laboratorial. Computational in-silico methods for the prediction of transcription factor binding sites are mature, yet critical problems remain unsolved. In particular, the rate of falsely predicted sites is unacceptably high with current methods, due to the small and degenerate binding sites targeted by transcription factors. In addition to the false prediction rate, this restriction limits the ability of pattern discovery algorithms to find mediating binding sites in promoters of co-expressed genes. The latter problem constitutes a bottleneck when analyzing regulatory sequences in complex eukaryotes, as regulatory sequences generally are spread over extended genomic regions. This thesis describes the development of algorithms and resources for transcription factor binding site analysis in addressing: site prediction, where a model describing the binding properties of a transcription factor is applied to a sequence to find functional binding sites pattern discovery, where over-represented patterns are sought in sets of promoters. Initially, an open-access database (JASPAR) was created, holding high quality models for transcription factor sites. The database formed part of the foundation for the subsequent project (ConSite), where a set of methods were developed for utilizing cross-species comparison in binding site prediction ( phylogenetic footprinting ) to enhance predictive selectivity. In this study, we could show that ~85% of false predictions were removed when only analyzing promoter regions conserved between human and mouse. The current statistical framework for modeling binding properties of transcription factors is inadequate for some regulatory proteins, most notably the medically important nuclear hormone receptors. A Hidden Markov Model framework capable of both predicting and classifying nuclear hormone receptor response elements was developed. In a case study, we showed that nuclear receptor genes have a high potential for cross-or auto regulation using the pufferfish genome as a predictive platform. Pattern discovery in promoters of multi-cellular eukaryotes is limited by the low strength of patterns buried in extended genomic sequence. Methods for improving both sensitivity and evaluation of resulting patterns were developed. We showed that comparison of newly found patterns to databases of experimentally verified profiles is a meaningful complement to other means to evaluate patters. Furthermore, we showed that structural constraints that are shared by families of transcription factors can be integrated as prior expectations in pattern finder algorithms for a significant increase in sensitivity.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.