Bioinformatic methods in protein characterization

University dissertation from Stockholm : Karolinska Institutet, Department of Medical Biochemistry and Biophysics

Abstract: Bioinformatics is an emerging interdisciplinary research field in which mathematics. computer science and biology meet. In this thesis. bioinformatic methods for analysis of functional and structural properties among proteins will be presented. I have developed and applied bioinformatic methods on the enzyme superfamily of short-chain dehydrogenases/reductases (SDRs), coenzyme-binding enzymes of the Rossmann fold type, and amyloid-forming proteins and peptides. The basis for bioinformatics is the availability of biological data, collected in different types of database. A non-redundant protein sequence database, KIND, has been compiled using a modification of the naïve algorithm. The database consists of the non- redundant union of the two protein sequence databases SWISSPROT and PIR, and the two databases derived from open reading frames TrEMBL and Genpept. By applying sequence comparison techniques in the form of multiple and pair-wise alignment methods, protein sequences from known complete genomes have been compared and SDR members have been identified. Inter-species comparisons reveal eight protein clusters in common to human, animal and plant genomes. The SDR superfamily, previously divided into only two families, is now found to consist of Five families. Using a combination of hidden Markov models and motifs, an extendable assignment scheme has been developed, including a subfamily division of the two largest families, based on coenzyme specificity. This scheme will be a valuable tool in functional and structural assignments of novel SDR members. Coenzyme specificity has also been addressed in a more general sense, where a coenzyme prediction method has been developed. The method is based on the existence of specific sequence motifs, characteristic of coenzyme binding. Given an amino acid sequence, identification of coenzyme-binding regions can be done with over 90% success rate, but prediction of coenzyme type still needs to be improved. A method to predict amyloid fibril-forming proteins has been developed, utilizing unsuccessful secondary structure predictions of regions with weak a-helical propensities. Experimentally determined a-helices were compared with their predicted secondary structures. in a large set of proteins. Among these, it was found that proteins with amyloid fibril-forming tendencies harbours a-helices that are falsely predicted to be beta-strands, suggesting that this type of proteins have segments with ail amino acid composition typical of beta-strands rather than a-helices. This phenomenon, now referred to as discordance, probably is one of the reasons why some proteins fail to fold properly. and instead form insoluble fibrils that, directly or indirectly, are the cause of severe diseases such as Alzheimer's disease and the prion diseases. In conclusion. this thesis shows that bioinformatic methods, applied to protein sequence data, are important tools to study and characterize structural and functional properties among proteins.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.