Software Tools and Algorithms for Shotgun Sequence Assembly

University dissertation from Uppsala : Acta Universitatis Upsaliensis

Abstract: During the last ten years, a genomics revolution has changed the ways biological research is carried out. The draft sequence of the human genome is available, as well as the sequence of 84 other completed genomes. High-throughput genomics technologies such as genome sequencing with associated bioinformatics tools have been instrumental in this process. The draft genome sequences were determined using the shotgun sequencing strategy, where long DNA molecules are randomly sheared into small pieces from which sequences are determined. These are assembled by computer programs in order to reconstruct the original genome sequence. Ubiquitous repeated sequences together with errors in the sequencing process complicate the assembly of shotgun fragments. In most genome projects gaps are caused by this complication. This thesis presents methods and algorithms to separate repeated sequences in shotgun projects. The Tandem Repeat Assembly Program (TRAP) builds multiple alignments of reads, which are then analyzed in order to discriminate sequencing errors from real differences between highly similar repeats. The method is based on the fact that sequencing errors are randomly distributed, as opposed to the systematic distribution of mutations in repeat copies. The TRAP assembler was shown to be able to correctly assemble 2000 bp repeat copies that are repeated in tandem eight times. The degree of difference between repeat copies was 1.0%, and the maximum sequencing error 11%.A refined method based on single base differences between repeat copies has been developed to further improve repeat separation. Results show that in the same sequence, 87% of all the single base differences present in the repeats can be detected, with an error of only 1.6%.In addition, a novel pattern-matching algorithm was developed. This algorithm takes advantage of the inherent symmetry between indices that can be computed for similar words of the same length and was implemented in the error correction software, MisEd. The results show that up to 99.3% of the sequencing errors can be corrected, while up to 87% of the single base differences remain.All methods described have thus been shown to be functional, and it is clear that these programs will facilitate genome sequencing and assembly.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.