Planning Anonymization

Explain a bit about where we are in the workflow of research now

Anonymization of data, in most cases, means covering or obscuring the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility; otherwise, they are rendered useless.

The goal is to use the appropriate anonymization techniques to achieve an acceptable level of risk for each individual, while at the same time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS).

To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks, taking what we learned in the last chapter and putting it into practice.

The `sdcMicro` Package

To that end (and for most of the following exercises), we will use the package sdcMicro in R and the test dataset it contains. You already learned about the data it contained in the last exercise (add link to exercise)

sdcMicro is an R package specialized in the anonymisation of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).

Let’s generate some fake data: (change when exercises are finalized, I’ll probably use test data provided by the package)

data <- data.frame(
  age = c(34, 35, 36, 36, 36, 40, 40, 41, 42, 42),
  gender = c("M", "M", "M", "F", "F", "M", "F", "F", "M", "M"),
  zipcode = c(
    "12345",
    "12345",
    "12345",
    "12345",
    "12345",
    "67890",
    "67890",
    "67890",
    "67890",
    "67890"
  ),
  diagnosis = c(
    "Flu",
    "Cold",
    "Flu",
    "Cold",
    "Allergy",
    "HIV",
    "HIV",
    "Flu",
    "Cold",
    "Allergy"
  )
)

 #| echo: false   
 knitr::kable(head(data), caption = "Raw data example")

Raw data example
age	gender	zipcode	diagnosis
34	M	12345	Flu
35	M	12345	Cold
36	M	12345	Flu
36	F	12345	Cold
36	F	12345	Allergy
40	M	67890	HIV

Identifying Privacy Risks

Include exercising on identifying possible risks within the data

Include examples for attack scenarios

Measuring Privacy Risks

Explain process of assessing the risk with sdcMicro

Include exercise on applying this to the example data

At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.

Learning Objective

After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.

Exercises

calculate the privacy risk for certain variables

The sdcMicro Package

Identifying Privacy Risks

Measuring Privacy Risks

Learning Objective

Exercises

Resources, Links, Examples

The `sdcMicro` Package