On practical machine learning and data analysis

University dissertation from Stockholm : KTH

Abstract: This thesis discusses and addresses some of the difficulties associated with practical machine learning and data analysis. Introducing data driven meth- ods in e. g. industrial and business applications can lead to large gains in productivity and efficiency, but the cost and complexity are often overwhelm- ing. Creating machine learning applications in practise often involves a large amount of manual labour, which often needs to be performed by an experi- enced analyst without significant experience with the application area. We will here discuss some of the hurdles faced in a typical analysis project and suggest measures and methods to simplify the process.One of the most important issues when applying machine learning meth- ods to complex data, such as e. g. industrial applications, is that the processes generating the data are modelled in an appropriate way. Relevant aspects have to be formalised and represented in a way that allow us to perform our calculations in an efficient manner. We present a statistical modelling framework, Hierarchical Graph Mixtures, based on a combination of graphi- cal models and mixture models. It allows us to create consistent, expressive statistical models that simplify the modelling of complex systems. Using a Bayesian approach, we allow for encoding of prior knowledge and make the models applicable in situations when relatively little data are available.Detecting structures in data, such as clusters and dependency structure, is very important both for understanding an application area and for speci- fying the structure of e. g. a hierarchical graph mixture. We will discuss how this structure can be extracted for sequential data. By using the inherent de- pendency structure of sequential data we construct an information theoretical measure of correlation that does not suffer from the problems most common correlation measures have with this type of data.In many diagnosis situations it is desirable to perform a classification in an iterative and interactive manner. The matter is often complicated by very limited amounts of knowledge and examples when a new system to be diag- nosed is initially brought into use. We describe how to create an incremental classification system based on a statistical model that is trained from empiri- cal data, and show how the limited available background information can still be used initially for a functioning diagnosis system.To minimise the effort with which results are achieved within data anal- ysis projects, we need to address not only the models used, but also the methodology and applications that can help simplify the process. We present a methodology for data preparation and a software library intended for rapid analysis, prototyping, and deployment.Finally, we will study a few example applications, presenting tasks within classification, prediction and anomaly detection. The examples include de- mand prediction for supply chain management, approximating complex simu- lators for increased speed in parameter optimisation, and fraud detection and classification within a media-on-demand system.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)