Utility of Knowledge Discovered from Sanitized Data
While much attention has been paid to data sanitization methods with the aim of protecting users’ privacy, far less emphasis has been put to the usefulness of the sanitized data from the view point of knowledge discovery systems. We consider this question and ask whether sanitized data can be used to obtain knowledge that is not defined at the time of the sanitization. We propose a utility function for knowledge discovery algorithms, which quantifies the value of the knowledge from a perspective of users of the knowledge. We then use this utility function to evaluate the usefulness of the extracted knowledge when knowledge building is performed over the original data, and compare it to the case when knowledge building is performed over the sanitized data. Our experiments use an existing cooperative learning model of knowledge discovery and medical data, anonymized and perturbed using two widely known sanitization techniques, called E-differential privacy and k-anonymity. Our experimental results show that although the utility of sanitized data can be drastically reduced and in some cases completely lost, there are cases where the utility can be preserved. This confirms our strategy to look at triples consisting of a utility function, a sanitization mechanism, and a knowledge discovery algorithm that are useful in practice. We categorize a few instances of such triples based on usefulness obtained from experiments over a single database of medical records. We discuss our results and show directions for future work.
utility of knowledge, privacy-preserving data mining, differential privacy, k-anonymity, cooperative learning