Regression on high-dimensional predictor space : With application in chemometrics and microarray data

University dissertation from Stockholm : Karolinska Institutet, Department of Medical Epidemiology and Biostatistics

Abstract: This thesis focuses on regression methodology for prediction and classification in situations where there are many predictors but limited number of observations. This situation is common in chemometrics and microarray data. In chemometrics, we obtain the absorbance level of each sample at hundreds or thousands of wavelengths (variables) in a calibration of a near-infrared (NIR) calibration. In microarray data, we have expression level of thousands of genes or proteins (variables) from each sample. When the variables are put in regression analysis, we have a vector of response and hundreds or thousands of predictors in the model. The challenge in regression analysis is how do we infer the pattern in the data when the number of samples is limited. The situation where we have a large number of variables but a limited number of samples in a dataset raises many problems. We address some of them in this research, including parameter estimation method, variable selection, and inference, and develop methodology to deal with them. We deal with the question of variable selection in NIR calibration and conclude that the variable selection does not guarantee better prediction. A case-by-case investigation is necessary to determine whether all available variables are relevant for prediction. In microarray data, we infer that procedures to select variables into logistic regression based on multivariate information give a better model fit than using t-statistics. We deal with the problem of parameter estimation with such a large number of variables by considering some models where we can put all of the available variables in the model. The goal of selecting differentially expressed genes from logistic regression with random effects is not possible due to a limited amount of information. To deal with this inference problem, we investigate a linear mixed model where we assume the random effects follow a mixture of three normal distributions. The mixture distribution corresponds to genes that are down, non, and up differentially expressed. The inference on each gene becomes whether a gene belongs to one of the mixture components. In this context, estimation of fold-change and identification of differentially expressed genes can be done simultaneously. We conclude that the method performs reasonably well to identify the genes. This is validated by a spike-in study and simulation. In applying the model to find coregulated genes, the method identifies the genes while its performance relies on the amount of information in the data.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.