Data Privacy in High-Dimensional and Big Data Age

Zakerzadeh, Hessam

Data Privacy in High-Dimensional and Big Data Age

Date

2015-12-07

Authors

Zakerzadeh, Hessam

Abstract

The prevalent need for publicly available data set along with the privacy-breach-related incidents occurring when such data is released, increases the need to develop resilient and precise techniques of privacy-preserving data publishing. To this end, numerous privacy models and algorithms have been developed for different data types. However, advances in privacy algorithms still suffer from two fundamental problems: data dimensionality and cardinality growth. The data dimensionality has remained a challenge for a wide variety of algorithms in data mining, clustering, classification and privacy. In the privacy domain, simply applying the existing privacy algorithms results in unacceptable information loss. Similar to the dimensionality problem, cardinality growth is an open problem in the privacy realm. In fact, privacy algorithms are not implementable in an acceptable time over tera-byte scale data sets. This thesis shows that some of the common properties of real data can be leveraged to ameliorate the negative effects of the curse of dimensionality in practice. In real data sets, many dimensions contain high levels of inter-attribute correlations. Such correlations enable the use of a process known as vertical fragmentation to create vertical subsets of smaller dimensionality. This allows the use of an anonymization process, which is based on combining results from multiple independent fragments. This dissertation presents a vertical fragmentation which is general enough to be applied to the k-anonymity and l-diversity models. In addition, this dissertation presents a new approach to privacy-preserving data mining of very massive data sets using MapReduce. Two of the most widely-used privacy models k-anonymity and l-diversity for anonymization are studied. We also investigate the privacy issue in publishing graph data commonly seen as big data sets (i.e. social networks). Graph data is generally more difficult to anonymize because the structural information “hidden” in the graph can be leveraged by an attacker to infer sensitive information. In big graph data publishing, we only focus on protecting attributes as they typically carry sensitive information.

Keywords

Computer Science

Citation

Zakerzadeh, H. (2015). Data Privacy in High-Dimensional and Big Data Age (Doctoral thesis, University of Calgary, Calgary, Canada). Retrieved from https://prism.ucalgary.ca. doi:10.11575/PRISM/25520

URI

http://hdl.handle.net/11023/2665

Collections

Open Theses and Dissertations

Full item page