Page Title Here

K-anonymity

Definition and Initial Purpose K-anonymity was developed by computer scientist Latanya Sweeney following her demonstration of how she could re-identify supposedly anonymized medical records released by the State of Massachusetts. By linking public medical records with voter registration records using quasi-identifiers like age, zip code, and gender, she was able to single out individuals, including the governor’s medical records (p 54, Jarmul (2023)). The principle of k-anonymity attempts to mitigate such linkage attacks by ensuring that in any released dataset, each record is indistinguishable from at least k-1 other records concerning a set of “quasi-identifiers” (attributes that, when combined, can uniquely identify an individual). This means grouping people with similar sensitive attributes, and not releasing groups that have fewer than k people.

An example of K-anonnymity is given below.

Let’s generate some fake data:

data <- data.frame(
  age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
  gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
  zipcode = c("12345", "12345", "12345", "12345", "12345", "67890", "67890", "67890", "67890", "67890"),
  diagnosis = c("Flu", "Cold", "Flu", "Cold", "Allergy", "HIV", "HIV", "Flu", "Cold", "Allergy")
)
Raw data example
age gender zipcode diagnosis
34 M 12345 Flu
35 M 12345 Cold
36 M 12345 Flu
36 F 12345 Cold
36 F 12345 Allergy
40 M 67890 HIV
Back to top

Differential privacy (Optional sub-topic - short & collapsable)

Jarmul, Katharine. 2023. Practical Data Privacy. " O’Reilly Media, Inc.".