Towards Privacy Preserving Micro-data Analysis : A machine learning based perspective under prevailing privacy regulations

Abstract: Machine learning (ML) has been employed in a wide variety of domains where micro-data (i.e., personal data) are used in the training process. In recent research, it has been shown that ML models are vulnerable to privacy attacks that exploit their observable predictions and optimization information in order to extract sensitive information about the underlying data subjects. Therefore, models trained on micro-data pose a distinct threat to the privacy of the data subjects. To mitigate these risks, privacy preserving machine learning (PPML) techniques are proposed in the literature. Existing PPML techniques are mainly based on differential privacy or cryptography based techniques. However, using these techniques for privacy preservation either results in poor predictive accuracy of the derived ML models or a high computational cost. Also, they operate under the assumption that raw data are available for training the ML models.Due to stringent requirements for data protection and data publishing, it is plausible that the micro-data are anonymized by the data controllers before releasing them for analysis. In the event that anonymized data are available for ML model training, it is vital to understand its impact on ML utility and privacy aspects. In literature on data privacy, anonymization and PPML are often studied as two disconnected fields. But we argue that a natural synergy exists between these two fields that results in a myriad of benefits for the data controllers as well as for the data subjects, in the light of new privacy regulations, business requirements, and privacy risk factors. When anonymized data are used to train the ML models there is an intrinsic requirement to re-think the existing privacy preserving mechanisms used in both data anonymization and PPML.One of the main contributions of this thesis is, understanding the opportunities and challenges presented by data anonymization in a ML setting. During this exploration, we highlight how certain provisions of the General Data Protection Regulation (GDPR) could be in direct conflict with the interest of ML utility and privacy. Inspired by these findings, we then propose a novel anonymization technique based on probabilistic k-anonymity that comprises amenable characteristics for ML utility and privacy. Next, we introduce a privacy-preserving technique for ML model selection based on integral privacy that can inhibit the inferences drawn by the adversaries about the training data or their transformations over time, by the means of selecting models with certain characteristics that can improve the adversary’s uncertainty. Moreover, we provide a rigorous characterization of a well-known privacy attack targeting the ML models (i.e., membership inference), and then identify the limitations of the existing methods that can easily be manipulated in order to overstate or understate the particular privacy risk. Finally, we present a new membership inference attack model, based on activation pattern based anomaly detection that overcomes these limitations while providing greater accuracy in identifying membership.Together, we believe these contributions will broaden the understanding of the research community, not only concerning the technical aspects of preserving privacy in ML but also highlighting its interplay with existing privacy regulations such as GDPR. It is hoped such findings will shape our journey for knowledge discovery in the era of big data.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)