The use of structural information to improve biological sequence similarity searches

Abstract: Bioinformatics is a fast-developing field that make use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins. To determine the structure of a protein is a crucial part in the characterisation of the molecule. The structure can also give clues about how the protein functions in the cell. Since the experimental determination of a protein structure can be both difficult and time-consuming, and in some cases is impossible using current techniques, it is desirable to be able to predict the structure. If two protein sequences are very similar, it is known that they share the same structure. However, there are many proteins that share the same fold, but have no clear sequence similarity. To find these relationships, and be able to predict the structure of these proteins, so called “protein fold recognition methods” have been developed. In this thesis, the field of bioinformatics is briefly surveyed, and two fold recognition methods are presented. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways. The first paper introduces the reader to the field of molecular biology, and also describes some common tools used for protein sequence comparison. HMMs in general are described in detail, as well as some methods for the construction of multiple structure superposition. Since 3D structure is more conserved than sequence, it is expected that a multiple sequence alignment based on a ultiple structure superposition, is more biologically correct than an lignment based on sequence information, especially for proteins with low sequence identities. Our structure anchored HMMs (saHMMs), which are presented in the paper, are constructed from multiple sequence alignments that are based on structural superposition. The paper also describes the selection of representatives for each protein family, that were used for the construction of the saHMMs. In this selection, no protein in a given family have a sequence identity higher than a certain threshold to any other protein in the same family. The threshold is defined as the border to the so-called twilight zone. The saHMMs are shown to be able to find the family relationships for almost 90% of the test cases, even when the saHMMs are based on two proteins only. The second paper describes the secondary structure HMMs (ssHMMs). These HMMs are based on an ordinary multiple sequence alignment, as well as on the secondary structures of the proteins. When a query sequence is compared to the ssHMM, a predicted secondary structure is used, and the score based on the sequence is increased or decreased depending on the match of the secondary structures. A rigorous benchmark is also presented, and used to compare automatically generated HMMs with ordinary sequence search methods. The results show that the ordinary sequence search methods tested perform about as well as automatic HMMs built from multiple alignments. The ssHMMs, however, are better at detecting the correct fold of a protein than all the other methods tested.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.