Prediction Methods for High Dimensional Data with Censored Covariates

Abstract: While access to data steadily increases, not all data are straight-forward to use for prediction. Censored data are common in several industrial scenarios, and typically arise when there are some limitations to measuring equipment such as for instance concentration measuring equipment in chemistry or signal receivers in signal processing. In this thesis, we take several angles to censored covariate data for prediction problem. We explore the impact on both covariates and the response when the censored covariates are imputed. We consider linear approaches as well as non-linear approaches, and we explore how both frequentist models as well as Bayesian models perform with censored covariate data. While the focus is using the imputed covariate data for prediction, we also investigate model parameter inference and uncertainty inferred by the imputations. We use real, censored covariate telecommunications data for prediction with some of the most commonly used prediction models and evaluate the performance when single imputations are made. We propose a selective multiple imputation approach which is suitable for high dimensional data that perform well with heavy censoring. We take a Bayesian linear regression approach leveraging information from auxiliary variables using multivariate regression and introduce multivariate draws from conditional distributions to update censored values in the covariates. We fnally offer a bridge between the fexibility of Neural Networks and the probabilistic nature of Bayesian methods by taking a Variational Autoencoder approach and introducing Zero-Infated Truncated Gaussian likelihoods for the covariates to better ft the censored distributions. 

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.