Data Privacy in High-Dimensional and Big Data Age

Barker, KennethZakerzadeh, Hessam2015-12-072015-12-072015-12-072015http://hdl.handle.net/11023/2665The prevalent need for publicly available data set along with the privacy-breach-related incidents occurring when such data is released, increases the need to develop resilient and precise techniques of privacy-preserving data publishing. To this end, numerous privacy models and algorithms have been developed for different data types. However, advances in privacy algorithms still suffer from two fundamental problems: data dimensionality and cardinality growth. The data dimensionality has remained a challenge for a wide variety of algorithms in data mining, clustering, classification and privacy. In the privacy domain, simply applying the existing privacy algorithms results in unacceptable information loss. Similar to the dimensionality problem, cardinality growth is an open problem in the privacy realm. In fact, privacy algorithms are not implementable in an acceptable time over tera-byte scale data sets. This thesis shows that some of the common properties of real data can be leveraged to ameliorate the negative effects of the curse of dimensionality in practice. In real data sets, many dimensions contain high levels of inter-attribute correlations. Such correlations enable the use of a process known as vertical fragmentation to create vertical subsets of smaller dimensionality. This allows the use of an anonymization process, which is based on combining results from multiple independent fragments. This dissertation presents a vertical fragmentation which is general enough to be applied to the k-anonymity and l-diversity models. In addition, this dissertation presents a new approach to privacy-preserving data mining of very massive data sets using MapReduce. Two of the most widely-used privacy models k-anonymity and l-diversity for anonymization are studied. We also investigate the privacy issue in publishing graph data commonly seen as big data sets (i.e. social networks). Graph data is generally more difficult to anonymize because the structural information “hidden” in the graph can be leveraged by an attacker to infer sensitive information. In big graph data publishing, we only focus on protecting attributes as they typically carry sensitive information.engUniversity of Calgary graduate students retain copyright ownership and moral rights for their thesis. You may use this material in any way that is permitted by the Copyright Act or through licensing that has been assigned to the document. For uses that are not allowable under copyright legislation or licensing, you are required to seek permission.Computer ScienceData PrivacyHigh-dimensional dataBig Datak-anonymityl-diversityData Privacy in High-Dimensional and Big Data Agedoctoral thesis10.11575/PRISM/25520