Evolutionary Approaches to Sequence Alignment

Abstract: Molecular evolutionary biology allows us to look into the past by analyzing sequences of amino acids or nucleotides. These analyses can be very complex, often involving advanced statistical models of sequence evolution to construct phylogenetic trees, study the patterns of natural selection and perform a number of other evolutionary studies. In many cases, these evolutionary studies require a prerequisite of multiple sequence alignment (MSA) - a technique, which aims at grouping the characters that share a common ancestor, or homology, into columns. This information regarding shared homology is needed by statistical models to describe the process of substitutions in order to perform evolutionary inference. Sequence alignment, however, is difficult and MSAs often contain whole regions of wrongly aligned characters, which impact downstream analyses.In this thesis I use two broad groups of approaches to avoid errors in the alignment. The first group addresses the analysis methods without sequence alignment by explicitly modelling the processes of substitutions, and insertions and deletions (indels) between pairs of sequences using pair hidden Markov models. I describe an accurate tree inference method that uses a neighbor joining clustering approach to construct a tree from a matrix of model-based evolutionary distances.Next, I develop a pairwise method of modelling how natural selection acts on substitutions and indels. I further show the relationship between the constraints acting on these two evolutionary forces to show that natural selection affects them in a similar way.The second group of approaches deals with errors in existing alignments. I use a statistical model-based approach to evaluate the quality of multiple sequence alignments.First, I provide a graph-based tool for removing wrongly aligned pairs of residues by splitting them apart. This approach tends to produce better results when compared to standard column-based filtering.Second, I provide a way to compare MSAs using a probabilistic framework. I propose new ways of scoring of sequence alignments and show that popular methods produce similar results.The overall purpose of this work is to facilitate more accurate evolutionary analyses by addressing the problem of sequence alignment in a statistically rigorous manner.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)