Explain a bit about where we are in the workflow of research now
Anonymization of data, in most cases, means covering or obscuring the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility; otherwise, they are rendered useless.
The goal is to use the appropriate anonymization techniques to achieve an acceptable level of risk for each individual, while at the same time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS).
To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks, taking what we learned in the last chapter and putting it into practice.
Identifying Privacy Risks
Include exercising on identifying possible risks within the data
To that end (and for most of the following exercises), we will use the package sdcMicro in R and the test dataset it contains. You already learned about the data it contained in the last exercise (add link to exercise)
sdcMicro is an R package specialized in the anonymisation of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).
Include more info on functions in sdcMicro?
Using sdcMicro To Assess Privacy Risks
Let’s start by creating a sdcMicro object using the class sdcMicroObj. The object becomes the working environment for all subsequent anonymization and risk assessment steps. It stores
the original dataset,
the roles of variables within the dataset,
risk measures,
anonymized versions of the variables,
utility measures, and
the history of modifications.
sdcMicro updates risk and utilty estimates after each anonymization step.
For this object, we start with defining the following:
dat: the dataset we are working with
keyVars: indirect identifiers that are categorial (i.e., have categories) we want to assess the risk of (e.g., age, gender)
The input dataset consists of 200 rows and 13 variables.
--> Categorical key variables: gender, age, education, plz, years_in_job
--> Numerical key variables: income
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 200 (100.000%)
- 3-anonymity: 200 (100.000%)
- 5-anonymity: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income
Disclosure risk is currently between [0.00%; 100.00%]
Current Information Loss:
- IL1: 0.00
- Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
# Risks summaryprint(sdc_data, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 200.00 (100.00 %)
# Individual riskindividualrisk <-get.sdcMicroObj(sdc_data, "risk")$individual # What is fk# Continuous riskprint(sdc_data, type ="numrisk") # Same as overall summary
Numerical key variables: income
Disclosure risk is currently between [0.00%; 100.00%]
Current Information Loss:
- IL1: 0.00
- Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.
Save sdcObject
saveRDS(sdc_data, "../sdc_micro.rds")
Learning Objective
After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.