Planning Anonymization

Explain a bit about where we are in the workflow of research now

Anonymization of data, in most cases, means to cover or obscure the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility, otherwise they are rendered useless.

The goal is to use the approriate anonymization technniques to achieve an acceptable level of risk for each individual, while at the main time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS)

To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks.

The sdcMicro Package

To that end (and for most of the coming exercises), we will use the package sdcMicro in R and the test dataset it contains. You already learned about the data it contained in the last exercise (add link to exercise)

sdcMicro is an R package specialized in the anonymisation of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).

Let’s generate some fake data: (change when exercises are finalized, I’ll probably use test data provided by the package)

data <- data.frame(
  age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
  gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
  zipcode = c(
    "12345",
    "12345",
    "12345",
    "12345",
    "12345",
    "67890",
    "67890",
    "67890",
    "67890",
    "67890"
  ),
  diagnosis = c(
    "Flu",
    "Cold",
    "Flu",
    "Cold",
    "Allergy",
    "HIV",
    "HIV",
    "Flu",
    "Cold",
    "Allergy"
  )
)
 #| echo: false   
 knitr::kable(head(data), caption = "Raw data example")
Raw data example
age gender zipcode diagnosis
34 M 12345 Flu
35 M 12345 Cold
36 M 12345 Flu
36 F 12345 Cold
36 F 12345 Allergy
40 M 67890 HIV

Identifying Privacy Risks

Include exercising on identifying possible risks within the data

Include examples for attack scenarios

Measuring Privacy Risks

Explain process of assessing the risk with sdcMicro

Include exercise on applying this to the example data

At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.

Learning Objective

  • After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.

Exercises

  • calculate the privacy level for certain variables

To Do List

  • Decide on federated learning
Back to top