Hidden Markov models for remote protein homology detection

University dissertation from Stockholm : Karolinska Institutet, Center for Genomics Research

Abstract: Genome sequencing projects are advancing at a staggering pace and are daily producing large amounts of sequence data. However, the experimental characterization of the encoded genes and proteins is lagging far behind. Interpretation of genomic sequences therefore largely relies on computational algorithms and on transferring annotation from characterized proteins to related uncharacterized proteins. Detection of evolutionary relationships between sequences protein homology detection - has become one of the main fields of computational biology. Arguably the most successful technique for modeling protein homology is the Hidden Markov Model (HMM), which is based on a probabilistic framework. This thesis describes improvements to protein homology detection methods. The main part of the work is devoted to profile HMMs that are used in database searches to identify homologous protein sequences that belong to the same protein family. The key step is the model estimation which aims to create a model that generalizes an often limited and biased training set to the entire protein family including members that have not yet been observed. This thesis addresses several issues in model estimation: i) prior probability settings, pointing at a conflict between modeling true positives and high discrimination; ii) discriminative training, by proposing an algorithm that adapts model parameters from non-homologous sequences; and iii) key HMM parameters, assessing the relative importance of different aspects of the estimation process, leading to an optimized procedure. Taken together, the work extends our knowledge of theoretical aspects of profile HMMs and can immediately be used for improved protein homology detection by profile HMMs. If related sequences are highly divergent, standard methods often fail to detect homology. The superfamily of G protein-coupled receptors (GPCRs) can be divided into families with almost complete lack of sequence similarity, yet sharing the same seven membrane-spanning topology. It would not be possible to construct a profile HMM that models the entire superfamily. We instead analyzed the GPCR superfamily and found conserved features in the amino acid distributions and lengths of membrane and non-membrane regions. Based on those high-level features we estimated an HMM (GPCRHMM), with the specific goal of detecting remotely related GPCRs. GPCRHMM is, at comparable error rates, much more sensitive than other strategies for GPCR discovery. In a search of five genomes we predicted 120 sequences that lacked previous annotation as possible GPCRs. The majority of these predictions (102) were in C. elegans, but also 4 were found in human and 7 in mouse.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.