<- data.frame(
data age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
zipcode = c("12345", "12345", "12345", "12345", "12345", "67890", "67890", "67890", "67890", "67890"),
diagnosis = c("Flu", "Cold", "Flu", "Cold", "Allergy", "HIV", "HIV", "Flu", "Cold", "Allergy")
)
Page Title Here
K-anonymity
Definition and Initial Purpose K-anonymity was developed by computer scientist Latanya Sweeney following her demonstration of how she could re-identify supposedly anonymized medical records released by the State of Massachusetts. By linking public medical records with voter registration records using quasi-identifiers like age, zip code, and gender, she was able to single out individuals, including the governor’s medical records (p 54, Jarmul (2023)). The principle of k-anonymity attempts to mitigate such linkage attacks by ensuring that in any released dataset, each record is indistinguishable from at least k-1 other records concerning a set of “quasi-identifiers” (attributes that, when combined, can uniquely identify an individual). This means grouping people with similar sensitive attributes, and not releasing groups that have fewer than k people.
An example of K-anonnymity is given below.
Let’s generate some fake data:
age | gender | zipcode | diagnosis |
---|---|---|---|
34 | M | 12345 | Flu |
35 | M | 12345 | Cold |
36 | M | 12345 | Flu |
36 | F | 12345 | Cold |
36 | F | 12345 | Allergy |
40 | M | 67890 | HIV |