Chemometrics : Unravelling information from complex data generated by modern instruments

Abstract: Chemometrics is a discipline dedicated to solving problems arising from complicated analytical systems, combining statistics, mathematics, and computational programming languages.This thesis is based on the work developed in four scientific projects published as papers in scientific journals. The studies developed in these projects have been essentially focused on a data analysis perspective, interpreting complicated data by means of algorithms, employing chemometrical methodologies. Several chemometrical approaches, based on multivariate data analysis and signal processing algorithms have been studied and employed in each project. Most of the data analysis problems studied these projects are related to liquid chromatography hyphenated to mass spectrometry systems, including tandem mass spectrometry. One of the projects has been related to spectrophotometric data.Chromatographic peak shifts have been attributed to lack of control of the nominal chromatographic parameters. The purpose of the work presented in Paper I was to study retention time data, obtained experimentally by provoking peak shifts with controlled effects, to demonstrate that there are patterns associated with such changing factors affecting chromatographic processes. PCR (Principal Component Regression) models were calculated for each compound (98 compounds), using the retention time data of each compound as responses (y), and the retention time data of the remaining compounds as regressors (X). The results demonstrate that the peak shifts of each compound across samples are correlated with the peak shifts of the other compounds in the chromatographic data. This work confirmed a previous work, where an algorithm was developed to improve alignment of peaks in large number of complex samples, based on peak shift patterns.Partial Least Squares (PLS) is one of the mostly used chemometrics techniques. In the work presented in Paper II, a previously reported modified PLS algorithm was studied. This algorithm was developed with the purpose of not generating overfitting models with increasing noise in X, which happens with the classical PLS. However, the results in less-noisy data were not as good as the classical PLS. From this study, we have developed another modified algorithm that does not overfit with increasing noise in X, and it converges with the solutions of the classical PLS in less-noisy data.DNA adductomics is a recent field in omics that studies modifications in the DNA. The goal of the project in Paper III was to develop a program with a graphical interface to interpret LC-MS/MS using a data independent acquisition method, to identify adducts in DNA nucleosides. The results were compared with those performed manually. The program detected over 150 potential adducts whereas manually, in a previous work, only about 25 were found. This program can detect adducts automatically in a matter of seconds.Cancer has been associated with processes that are related to exposure to pollutants and the consumption of certain food products. This process has been related to electrophilic compounds that react with DNA (adducts). When DNA modifications occur, often defense mechanisms in the cell are triggered often leading to the rupture of the cell. Fragments of DNA (micronuclei) are then roaming in the blood stream. In this work (Paper IV), electrophilic additions to hemoglobin (adducts) and the expression of micronuclei in blood samples from 50 children were studied. One of the goals of the project was to find correlations between the adducts in hemoglobin and the expression of micronuclei. PLS was used to model the data. However, the results were not conclusive (R2 =  0.60), i.e., there may be some trends, but there are other variables not modelled that may influence the variance in expression of micronuclei. 

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)