K-Anonymity Privacy Preservation in Data Mining

Thamindu Aluthwala
3 min readMar 9, 2020

Privacy preservation of data has always been important. Collection of the data by the companies is continuously growing, and analysis of such data has been beneficial towards society and business decision making. Today, a single company may possess sensitive personal data of hundreds of thousands of individuals. Storing and sharing those personal data has serious privacy concerns. Companies are entrusted with the privacy of the personal data they collect and store. Therefore, Privacy-Preserving Data Mining (PPDM) techniques have become very important.

Anonymization is turning data into a form that can’t be used to uniquely identify individuals. Even though anonymized data permanently de-links the personal data from a specific person, removing individual IDs such as names from a data set is not enough to achieve anonymization. Anonymized data can be de-anonymized by linking the data set with another publicly open data set (Link Attacks). The data may include a set of attributes that are not unique identifiers themselves but whose values together can identify an individual record, these are known as Quasi Identifiers. For example, in 1990, around 87% of the US population could be identified using 5-digit zip code, date of birth and gender taken together. Even when only a small percentage of individuals can be uniquely identifiable, it can lead to a severe privacy breach. K–anonymization was proposed as a solution to this situation.

K — Anonymity privacy model was first proposed by Latanya Sweeney in 1998 in her paper, K-Anonymity: A Model for Protecting Privacy. In dataprivacylab.org, Latanya defines k-anonymity as follows,

Consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured data. Suppose the data holder wants to share a version of the data with researchers. How can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? The solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deployment. A release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the release.

Example of 2-Anonymization

K-anonymity can be achieve by either generalization or suppression.

Generalization - Taking a specific information and making it less specific while keeping it accurate. For example changing age 7 into the age group 0–20.

Suppression - Completely omitting the information. For example omitting the gender of the particular group

In simple terms, k–anonymization either transforms quasi-identifiers to less accurate ones or omit them, so that, at least K number of people has the same quasi identifiers. By achieving K–anonymity, we reduce the possibility of uniquely identifying an individual against a link attack. Obviously for K-anonymity to be achieved, there needs to be at least K individuals in the data set. K–anonymity techniques come with the following conditions.

1. Sensitive columns must not reveal information that was redacted in quasi-identifiers.

Diseases related to men/women should not reveal the gender attribute

2. Values in the sensitive columns must not be equal for a particular group.

This makes data set vulnerable to homogeneity attack, in which it is enough to find the group of records the individual belongs to. For example, all individuals in the age of 0–20 group have diarrhea, and the attacker knows that James is between 0–20 and is in this data set. Therefore, he knows that James has diarrhea. Techniques like l-diversity and t-closeness can specify the diversity of sensitive values among any k matching values.

3. The dimensionality of the data must be sufficiently low.

It’s hard to achieve the privacy guarantee on high dimensional data as same as low dimensional data. For example, data types such as location data can be used to possibly identify an individual by stringing together multiple data points.

Still k-anonymity is a good privacy model when applied correctly and used with necessary safeguards.

--

--

Thamindu Aluthwala

Software Engineer @ WSO2 | CSE Undergraduate @ University of Moratuwa