Advancing systems biology of yeast through machine learning and comparative genomics

Abstract: Synthetic biology has played a pivotal role in accomplishing the production of high value commodities, pharmaceuticals, and bulk chemicals. Fueled by the breakthrough of synthetic biology and metabolic engineering, Saccharomyces cerevisiae and various other yeasts (such as Yarrowia lipolytica , Pichia pastoris ) have been proven to be promising microbial cell factories and are frequently used in scientific studies. However, the cellular metabolism and physiological properties for most of the yeast species have not been characterized in detail. To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism. In this thesis, machine learning was applied to various unresolved biological problems on yeasts, i.e., gene essentiality, enzyme turnover number (kcat), and protein production. In the first part of the work, machine learning approaches were employed to predict gene essentiality based on sequence features and evolutionary features. It was demonstrated that the essential gene prediction could be substantially improved by integrating evolution-based features. Secondly, a high-quality deep learning model DLKcat was developed to predict kcat values by combining a graph neural network for substrates and a convolutional neural network for proteins. By predicting kcat profiles for 343 yeast/fungi species, enzyme-constrained models were reconstructed and used to further elucidate the cellular metabolism on a large scale. Lastly, a random forest algorithm was adopted to investigate feature importance analysis on protein production, it was found that post-translational modifications (PTMs) have a relatively higher impact on protein production compared with amino acid composition. In comparative genomics, a comprehensive toolbox HGTphyloDetect was developed to facilitate the identification of horizontal gene transfer (HGT) events. Case studies on some yeast species demonstrated the ability of HGTphyloDetect to identify horizontally acquired genes with high accuracy. In addition, through systematic evolution analysis (e.g., HGT, gene family expansion) and genome-scale metabolic model simulation, the underlying mechanisms for substrate utilization were further probed across large-scale yeast species.