Planning Anonymization

Explain a bit about where we are in the workflow of research now

Anonymization of data, in most cases, means covering or obscuring the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility; otherwise, they are rendered useless.

The goal is to use the appropriate anonymization techniques to achieve an acceptable level of risk for each individual, while at the same time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS).

To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks, taking what we learned in the last chapter and putting it into practice.

Identifying Privacy Risks

Include exercising on identifying possible risks within the data

Identity disclosure, attribute disclosure, membership disclosure

Include examples for attack scenarios

Measuring Privacy Risks

The sdcMicro Package

To that end (and for most of the following exercises), we will use the package sdcMicro in R and the test dataset it contains. You already learned about the data it contained in the last exercise (add link to exercise)

sdcMicro is an R package specialized in the anonymisation of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).

Include more info on functions in sdcMicro?

Using sdcMicro To Assess Privacy Risks

Let’s start by creating a sdcMicro object using the class sdcMicroObj. The object becomes the working environment for all subsequent anonymization and risk assessment steps. It stores

  • the original dataset,
  • the roles of variables within the dataset,
  • risk measures,
  • anonymized versions of the variables,
  • utility measures, and
  • the history of modifications.

sdcMicro updates risk and utilty estimates after each anonymization step.

For this object, we start with defining the following:

  • dat: the dataset we are working with

  • keyVars: indirect identifiers that are categorial (i.e., have categories) we want to assess the risk of (e.g., age, gender)

  • numVars: continuous (i.e., numeric) indirect identifiers (e.g., income)

sdcMicro needs, at the least, the data and categorial indirect identifiers to assess risks. We can later add more information to the object.

Exercise: Assessing Privacy Risks With sdcMicro

  1. Create an object within sdcMicro. Inspect the object.
  2. What k-anonymity level is currently reached?

Solution:

library(sdcMicro)

sdc_data <- createSdcObj(
  dat = data_withoutdirectidentifiers, 
  keyVars = c("gender", "age", "education", "plz", "years_in_job"),
  numVars = "income"
  )  

# Overall summary
sdc_data
The input dataset consists of 200 rows and 13 variables.
  --> Categorical key variables: gender, age, education, plz, years_in_job
  --> Numerical key variables: income
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   66   (66)     3.030  (3.030)
    education                    5    (5)    40.000 (40.000)
          plz                  161  (161)     1.242  (1.242)
 years_in_job                   29   (29)     6.897  (6.897)
 Size of smallest (>0)       
                <char> <char>
                    11   (11)
                     1    (1)
                     5    (5)
                     1    (1)
                     1    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 200 (100.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 0.00
  - Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
# Risks summary
print(sdc_data, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 200.00 (100.00 %)
# Individual risk
individualrisk <- get.sdcMicroObj(sdc_data, "risk")$individual # What is fk
# Continuous risk
print(sdc_data, type = "numrisk") # Same as overall summary
Numerical key variables: income

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 0.00
  - Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------

At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.

Save sdcObject

saveRDS(sdc_data, "../sdc_micro.rds")

Learning Objective

  • After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.

Exercises

  • calculate the privacy risk for certain variables
Back to top