The labyrinth of protein classification: a pipeline forselection and classification of biological data

Abstract: Recent progress in fundamental biological sciences and medicine has considerably increased the quantity ofdata that can be studied and processed. The main limitation now is not retrieving data, but rather extractinguseful biological insights from the large datasets accumulated. More recent advances have provided detailedhigh-density data regarding metabolism (metabolomics) and protein expression (proteomics). Clearly, no single analytic methods, can provide a comprehensive understanding. Rather, the ability to link available datatogether in a coherent manner is required to obtain a complete view. The improving application of MachineLearning (ML) techniques provides the means to make continuous progress in processing complex data sets.A brief discussion is offered on the advantages of ML, the state-of-the-art in Deep Learning (DL) for proteinpredictions and the importance of ML in biological data processing. Noise stemming from incorrect classification or arbitrary/ambiguous labelling of data may arise when ML techniques are applied to large datasets. Furthermore, the stochasticity of biological systems needs to be considered for correctly evaluating theoutputs. Here we show the potential of a workflow to respond biological questions taking into consideration aperturbation of the biological data.  For controlling the applicability of models and maximizing the predictivity, in silico filtering schemescan usefully be applied as an “Ockham’s razor” before using any ML technique. After reviewing differentDL approaches for protein prediction purposes, this work shows that a computational approach in filteringsteps is a valuable tool for proteins classification when biological features are not fully annotated or reviewed.The in silico approach has identified putative proline transporters in fungi and plants as well as carotenoidbiosynthetic gene products in the plant family Brassicaceae. The proposed method is suitable for extractingfeatures of classification and then maximizing the use of a DL approach.