Hunter-gatherer-annotator science : characterizing regulatory elements in the genome of dog and zebrafish with public and not yet public data

Abstract: In order to study gene regulation, large amounts of sequencing data are necessary. We can either generate (hunt) them ourselves or use (gather) publicly available data sets. In order to guarantee the reliability and reusability of the hunted and gathered data, we need to also annotate them with the correct metadata. In this thesis, I will touch on all three of these aspects. I was part of two international consortia which applied these approaches to two different model organisms. The DANIO-CODE consortium was initiated to systematically annotate the zebrafish genome. Similarly, the Dog Genome Annotation (DoGA) project aims to improve the annotation of genomic elements in the dog genome. Both zebrafish and dogs are popular model organisms for studying biological processes and pathologies in humans. Despite their popularity, both organisms lack a large-scale annotation of regulatory elements. Before analyzing any data, we designed an annotation structure that captures all aspects of a sequencing experiment that are essential for the processing and analysis of the data. We implemented this structure in a web-platform, which allows easy upload, query, and download of the sequencing data and associated metadata. We present the structure and implementation in Study I, which also contains a comparison to similar and well-established annotation schemata. We use this annotation structure and the web platform for Study II to collect sequencing data from 1,803 samples from 38 different research groups looking from transcriptomic, epigenomic, and methylomic perspectives at different stages of zebrafish development. We identified more than 140,000 new cis-regulatory elements active during development and provide them together with the sequencing data and genome browser tracks as a resource for the community. In Study III, we present a biobank for dog tissues established for the DoGA consortium. For both Study III and Study IV, we used 88 and 37 tissues from the biobank, respectively, to catalog promoter regions and their tissue activity using STRT and CAGE-seq. In Study III we also present the web-platform, based on the structure in Study I, where we make the data and the corresponding metadata available. In Study IV, we used the data from CAGE-seq to also identify active enhancer regions and their corresponding tissue activity. We identify regulatory networks between enhancers and promoters and show their conservation in human.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.