Unequal Probability Sampling in Active Learning and Traffic Safety

Abstract: This thesis addresses a problem arising in large and expensive experiments where incomplete data come in abundance but statistical analyses require collection of additional information, which is costly. Out of practical and economical considerations, it is necessary to restrict the analysis to a subset of the original database, which inevitably will cause a loss of valuable information; thus, choosing this subset in a manner that captures as much of the available information as possible is essential. Using finite population sampling methodology, we address the issue of appropriate subset selection. We show how sample selection may be optimised to maximise precision in estimating various parameters and quantities of interest, and extend the existing finite population sampling methodology to an adaptive, sequential sampling framework, where information required for sample scheme optimisation may be updated iteratively as more data is collected. The implications of model misspecification are discussed, and the robustness of the finite population sampling methodology against model misspecification is highlighted. The proposed methods are illustrated and evaluated on two problems: on subset selection for optimal prediction in active learning (Paper I), and on optimal control sampling for analysis of safety critical events in naturalistic driving studies (Paper II). It is demonstrated that the use of optimised sample selection may reduce the number of records for which complete information needs to be collected by as much as 50%, compared to conventional methods and uniform random sampling.