Page Title Here

K-anonymity

Definition and Initial Purpose K-anonymity was developed by computer scientist Latanya Sweeney following her demonstration of how she could re-identify supposedly anonymized medical records released by the State of Massachusetts. By linking public medical records with voter registration records using quasi-identifiers like age, zip code, and gender, she was able to single out individuals, including the governor’s medical records (p 54, Jarmul (2023)). The principle of k-anonymity attempts to mitigate such linkage attacks by ensuring that in any released dataset, each record is indistinguishable from at least k-1 other records concerning a set of “quasi-identifiers” (attributes that, when combined, can uniquely identify an individual). This means grouping people with similar sensitive attributes, and not releasing groups that have fewer than k people.

An example of K-anonnymity is given below.

Let’s generate some fake data:

data <- data.frame(
  age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
  gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
  zipcode = c("12345", "12345", "12345", "12345", "12345", "67890", "67890", "67890", "67890", "67890"),
  diagnosis = c("Flu", "Cold", "Flu", "Cold", "Allergy", "HIV", "HIV", "Flu", "Cold", "Allergy")
)

Raw data example
age	gender	zipcode	diagnosis
34	M	12345	Flu
35	M	12345	Cold
36	M	12345	Flu
36	F	12345	Cold
36	F	12345	Allergy
40	M	67890	HIV

Differential privacy (Optional sub-topic - short & collapsable)

Jarmul, Katharine. 2023. Practical Data Privacy. " O’Reilly Media, Inc.".