Applied bioinformatics for gene characterization

University dissertation from Stockholm : Karolinska Institutet, Department of Cell and Molecular Biology

Abstract: As the vast majority of human protein-encoding genes are now identified, the new challenge placed before life scientists is the determination of the functions of the proteins. Traditionally, intense, directed efforts are applied to decipher the function of a novel protein using laboratory techniques. Currently, increasing efforts are directed at the generation of high-throughput results for large numbers of genes using new technologies and the application of robotics to established methods. These efforts can generate large, complex, often noisy datasets, which are difficult to interpret. The extraction of information from genomics data that is relevant for specific scientific research efforts is required to accelerate functional characterization and annotation of genes by the scientific community. The research presented in this thesis highlights and addresses deficiencies in gene/protein function annotation. The bioinformatics tools and methodologies presented share the common theme of facilitating research scientists with means to understand and to interpret gene-specific data. The work, which addresses both diverse types of genomics data and a broad set of computational approaches, is united by the hypothesis that computational approaches to genomics data analysis can assist in the characterization of human protein-encoding genes. The initial sections of the thesis describe the identification of human protein-encoding genes for which there is little or no functional annotation. Chapter I presents the Gene Characterization Index, a bioinformatics method for quantifying the level of annotation of individual genes and monitoring progress. The Gene Characterization Index serves both as a tool for assessing the novelty of individual genes, and for the assessment of short-term annotation progress on a genome scale. In chapter 2, computational approaches are used to identify specific protein families that share evolutionarilyconserved domains for which the biochemical function is unknown. For the identified domains, a genecentric data centre, NovelFam3000, is created to facilitate shared annotation of protein function. Knowledge of a domain´s function, or the protein within which it arises, can facilitate the analysis of the entire set of proteins. The subsequent sections of the thesis focus on the creation of bioinformatics methods to assist human interpretation of gene function. Interpretation of large-scale biological data can be aided by visualization - humans can perform complex interpretation of data through visual assessment. In order to enable rapid identification of biological relationships within high-throughput genomics data, the Parallel HeatMap viewer was developed for four-dimensional data display and pattern observation. Researchers seeking to understand gene function often turn initially to the scientific literature. Hindered by a historic lack of standard gene and protein-naming conventions, they endure long, sometimes fruitless literature searches. The final chapter of the thesis focuses on the computational identification of abstracts which may be relevant to gene function - the essential and difficult challenge that must be overcome for computational assisted literature review. The SureGene system is based on supervised machine learning, resulting in a distinct model for the identification of relevant abstracts for each gene. The system was able to achieve high quality gene disambiguation using scalable automated techniques. This thesis explores the hypothesis that computational methods can facilitate the identification and characterization of poorly annotated genes. The bioinformatics approaches to this problem assist researchers in advancing our understanding of the functional of human protein encoding genes.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.