Synergies between Chemometrics and Machine Learning

Abstract: Thanks to digitization and automation, data in all shapes and forms are generated in ever-growing quantities throughout society, industry and science. Data-driven methods, such as machine learning algorithms, are already widely used to benefit from all these data in all kinds of applications, ranging from text suggestion in smartphones to process monitoring in industry. To ensure maximal benefit to society, we need workflows to generate, analyze and model data that are performant as well as robust and trustworthy.There are several scientific disciplines aiming to develop data-driven methodologies, two of which are machine learning and chemometrics. Machine learning is part of artificial intelligence and develops algorithms that learn from data. Chemometrics, on the other hand, is a subfield of chemistry aiming to generate and analyze complex chemical data in an optimal manner. There is already a certain overlap between the two fields where machine learning algorithms are used for predictive modelling within chemometrics. Although, since both fields aims to increase value of data and have disparate backgrounds, there are plenty of possible synergies to benefit both fields. Thanks to its wide applicability, there are many tools and lessons learned within machine learning that goes beyond the predictive models that are used within chemometrics today. On the other hand, chemometrics has always been application-oriented and this pragmatism has made it widely used for quality assurance within regulated industries. This thesis serves to nuance the relationship between the two fields and show that knowledge in either field can be used to benefit the other. We explore how tools widely used in applied machine learning can help chemometrics break new ground in a case study of text analysis of patents in Paper I. We then draw inspiration from chemometrics and show how principles of experimental design can help us optimize large-scale data processing pipelines in Paper II and how a method common in chemometrics can be adapted to allow artificial neural networks detect outlier observations in Paper III. We then show how experimental design principles can be used to ensure quality in the core of concurrent machine learning, namely generation of large-scale datasets in Paper IV. Lastly, we outline directions for future research and how state-of-the-art research in machine learning can benefit chemometric method development.