Approaches for Distributing Large Scale Bioinformatic Analyses

Abstract: Ever since high-throughput DNA sequencing became economically feasible, the amount of biological data has grown exponentially. This has been one of the biggest drivers in introducing high-performance computing (HPC) to the field of biology. Unlike physics and mathematics, biology education has not had a strong focus on programming or algorithmic development. This has forced many biology researchers to start learning a whole new skill set, and introduced new challenges for those managing the HPC clusters.The aim of this thesis is to investigate the problems that arise when novice users are using an HPC cluster for bioinformatics data analysis, and exploring approaches for how these can be mitigated. In paper 1 we quantify and visualise these problems and contrast them with the more computer experienced user groups already using the HPC cluster. In paper 2 we introduce a new workflow system (SciPipe), implemented as a Go library, as a way to organise and manage analysis steps. Paper 3 is aimed at cloud computing and how containerised tools can be used to run workflows without having to worry about software installations. In paper 4 we demonstrate a fully automated cloud-based system for image-based cell profiling. Starting with a robotic arm in a lab, it covers all the steps from cell culture and microscope to having the cell profiling results stored in a database and visualised in a web interface.