data <- data.frame(
age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
zipcode = c(
"12345",
"12345",
"12345",
"12345",
"12345",
"67890",
"67890",
"67890",
"67890",
"67890"
),
diagnosis = c(
"Flu",
"Cold",
"Flu",
"Cold",
"Allergy",
"HIV",
"HIV",
"Flu",
"Cold",
"Allergy"
)
)Planning Anonymization
Explain a bit about where we are in the workflow of research now
Anonymization of data, in most cases, means to cover or obscure the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility, otherwise they are rendered useless.
The goal is to use the approriate anonymization technniques to achieve an acceptable level of risk for each individual, while at the main time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS)
To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks.
The sdcMicro Package
To that end (and for most of the coming exercises), we will use the package sdcMicro in R and the test dataset it contains. You already learned about the data it contained in the last exercise (add link to exercise)
sdcMicro is an R package specialized in the anonymisation of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).
Let’s generate some fake data: (change when exercises are finalized, I’ll probably use test data provided by the package)
#| echo: false
knitr::kable(head(data), caption = "Raw data example")| age | gender | zipcode | diagnosis |
|---|---|---|---|
| 34 | M | 12345 | Flu |
| 35 | M | 12345 | Cold |
| 36 | M | 12345 | Flu |
| 36 | F | 12345 | Cold |
| 36 | F | 12345 | Allergy |
| 40 | M | 67890 | HIV |
Identifying Privacy Risks
Include exercising on identifying possible risks within the data
Include examples for attack scenarios
Measuring Privacy Risks
Explain process of assessing the risk with sdcMicro
Include exercise on applying this to the example data
At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.
Learning Objective
- After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.
Exercises
- calculate the privacy level for certain variables
Resources, Links, Examples
To Do List
- Decide on federated learning