Genomics and bioinformatics approaches to functional gene annotation

University dissertation from Stockholm : Karolinska Institutet, Center for Genomics Research

Abstract: Biomedical research has been undergoing a quasi-revolution with the dawn of the genomics era. The flood of sequence data from the various genome projects, the task of cataloging the entire coding portion of a genome instead of identifying and characterizing individual genes, as well as technical demands accompanying these developments have posed great challenges to the research community. Although the entire human genome sequence has been virtually recorded, fundamental issues remain about the precise number of protein coding genes, as well as their functional characterization. Available resources for the study of human gene function include large genome annotation pipelines, expression profiling data, and protein interaction screens. To gain biological insights from this maze of data, one must both find mechanisms to organize the information and assess the quality of the results. This thesis focuses on the functional annotation of sparsely characterized human genes and their encoded proteins. The work includes four stages: I. Gene expression profiling II. Assessment of the level of characterization of human genes III. Projection of protein networks from lower eukaryotes onto human IV. Integration of computational and experimental results for data mining. Initially, a cross-platform comparison for a set of gene expression profiling techniques was carried out to compare the performance of cutting-edge high-throughput methods and conventional approaches in terms of sensitivity, reliability, and throughput. In this study, we demonstrated that correlation between the different methods was poor and thus multi-technique validation was justified. Nonetheless, the strongest correlation between the new reference data in our report, i.e., a collection of traditional Northern blots, was observed with microarray-based technologies. The assessment of the level of functional characterization of human genes was addressed in the second study, where we developed a scoring system to quantify the annotation status of each human gene. We created a metric to effectively predict the characterization status of human genes based on a set of predictors from the GeneLynx database (http://www.genelynx.org). This scoring function will not only assist the targeted analysis of groups of sparsely annotated genes and proteins, but will prove itself useful in the monitoring of long-term gene annotation efforts and the overall annotation status of the human genome. Comparative genomics efforts to transfer gene annotation from proteins in amenable model organisms onto human proteins are currently restricted by the limited availability of experimental data. Nonetheless, we demonstrated how protein networks could be effectively projected from lower eukaryotes onto human and how the confidence in these projections increased with redundantly detected protein interactions. This so-called Interolog Analysis offers promise for reliable inference of protein function. The bioinformatics system we created (Ulysses) provides a novel intuitive interface for biologists studying human proteins. As data depth and coverage will increase over time, this system will prove to be valuable in the extended prediction of high-confidence functional associations of a large portion of human genes. The fusion of experimental data and computational predictions is a central goal of functional genomics. We constructed a bioinformatics workbench for the study of uncharacterized human gene families. By assembling bioinformatics resources and experimental results in a common space, the NovelFam3000 system facilitates functional characterization. Working with a collection of uncharacterized genes, we demonstrated how bioinformatics methods can lead to novel inferences about cellular function of specific protein families. This thesis unites the identification of uncharacterized human genes, the assessment of genomics data quality, and the application of high-throughput data for the inference of protein function.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.