Tree-based machine learning methods with non-life insurance applications

Abstract: Non-life insurance is a field which has been data-driven for a long time, with the statistical framework behind modern-day actuarial sciences laid out at the beginning of the 20th century. Problems regarding the estimation and prediction of risk are relevant to the insurance industry specifically, but also for society as a whole. The rise of machine learning methods has created a new set of tools that can be used to solve these problems. This thesis contains five individual papers, all of which are related to developing machine learning- or data-driven methods and algorithms that can be applied to, but are not limited to, non-life insurance applications.Paper I takes an existing probabilistic model for claims reserving, the Collective Reserving Model (CRM), and replaces the linear modeling approach of the original paper with non-linear machine learning methods. The paper addresses issues in these applications and provides a framework for how to implement and evaluate machine learning models in a reserving setting. It also discusses how to implement early stopping methods given different levels of data granularity. The models are evaluated on a series of simulated data sets with promising results.Paper II does not use a machine learning method per se but instead develops the CRM used in Paper I by adding the openness status of the claims to the dynamics and presents the CRM with Openness (CRMO), as a means to model the non-linear effects implied in Paper I. The paper presents how the model can be estimated using regression methods, and provides recursive formulas for the moments of the predicted reserve. The algorithm is evaluated in terms of accuracy on the same data set as in Paper I and shows results that are comparable to the machine learning implementations of the CRM model.Paper III presents a new boosting algorithm called the Cyclic Gradient Boosting Machine (CGBM). The algorithm extends the classical gradient boosting machine to provide multi-dimensional function approximation. The paper shows how the CGBM can be used to estimate entire probability distributions rather than just the mean of the distribution. The paper also discusses potential problems with hyperparameter tuning in this higher-dimensional hyperparameter space and provides a dimension-wise early stopping method, which is proven useful to avoid overfitting. Numerical illustrations show accurate results on simulated and real data sets.Paper IV is a paper that is not directly related to non-life insurance but rather to so-called decision trees used for classification and regression. The paper presents the trinary tree algorithm, which is a new way to handle missing input data for tree-based models, meant to provide a more regularized model than other suggested methods. The algorithm is benchmarked against standard methods for missing data-handling and shows promising results even for high rates of missing data.Paper V presents a generalized linear model with non-linear effects induced by varying coefficients, with the varying coefficients estimated using the CGBM from Paper III. This is a special case of a varying coefficient model (VCM). The model that can handle highly non-linear effects while maintaining local interpretability. The paper also shows how tuning, feature selection, and evaluation of interaction effects can be simplified as compared to other VCMs. The model is evaluated on the same data set as in Paper III and shows promising results in terms of accuracy and interpretability.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)