Assessing Privacy Risks

In the previous chapters, we covered the foundations: what counts as personal data, what the GDPR requires, what anonymization mechanisms exist, and what privacy risks look like. Now we move to the practical part - actually anonymizing data. But before we jump into specific techniques, we first need to assess how much risk our data carries. That is what this chapter is about.

Anonymization of data, in most cases, means covering or obscuring the data, meaning there is always a loss of information. However, when sharing data, we need to keep their utility; otherwise, they are rendered useless.

The goal is to use the appropriate anonymization techniques to achieve an acceptable level of risk for each individual, while at the same time, keeping utility as high as possible (see also BALANCING DATA PROTECTION AND OPENNESS).

To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks, taking what we learned in the last chapter and putting it into practice.

Identifying Privacy Risks

Take another look at the dataset from Chapter 1.3, the version without direct identifiers. Try to answer the following questions:

  • Which combinations of indirect identifiers could potentially single out an individual?
  • What attack scenarios are plausible? Think about who might try to re-identify someone in this dataset and what external information they might have access to.
  • What type of disclosure is most likely for this dataset - identity disclosure, attribute disclosure, or membership disclosure?

Add concrete combinations

Here are a few concrete attack scenarios to consider, based on our dataset of 200 individuals in Germany (with variables: gender, age, education, plz, income, religion, job_title, years_in_job, and political opinion variables):

  • A colleague knows that a specific person lives in a particular postal code area, works as a “Data Scientist,” and is 34 years old. If the combination of plz, job_title, and age is unique in the dataset, that colleague can identify the person and look up their income and political opinions. This is identity disclosure. In this scenario, the attacker has the knowledge that his colleague has participated in the study. In practice, when calculating risks, we usually assume no such knowledge (see callout box below).
  • A journalist cross-references the dataset with publicly available professional directories or LinkedIn profiles. By matching job_title, age, and education level, they could narrow down candidates and potentially identify individuals - especially in smaller postal code areas where few people share the same profession.
  • A neighbor knows that someone in their building participated in the study. They know the person’s approximate age, gender, and religion. Even if they cannot pinpoint the exact row, they might learn something new about that person’s income or political views. This is attribute disclosure - the attacker does not need to find the exact record, just a group of records that all share the same sensitive value.

These are just a few examples; in general, the existence and likelihood of risks is very much dependent on the variables in the dataset, the sensibility of data, the recruitment strategy, and many other factors. Elaborate more/link to other relevant parts

Add callout box whether participation knowledge needs to be assumed or not (and its implications)

Note

Felix’s notes:

  • “reasonably likely”: We should discuss here or somewhere else one of the main factors: The knowledge of an attacker whether a person is in a specific data set (”Teilnahmekenntnis”). This changes a lot for the k-anonymity. Here a my notes regarding this:

    • The relevant keyword is “Teilnahmekenntnis” / “participation knowledge”: The re-identification risk rises very much when attacks know who participated in a survey. Von Klaus Pforr (https://sciences.social/@klauspforr/110775060196195399): “Never post your survey participation in a run-of-the-mill survey on social media because the whole anonymisation concept most of time assumes negligble prior probs for these. Whenever you see this in the wild, send a direct message that people should refrain from this behavior in the future”

    • I talked to Johannes Breuer (who was at GESIS back then), and he said that the big surveys always assume NO Teilnahmekenntnis when evaluating their level of anonymity.

Measuring Privacy Risks

The sdcMicro Package

To formally measure privacy risks (and for most of the following exercises), we will use the package sdcMicro in R.

sdcMicro is an R package specialized in the anonymization of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).

Include more info on functions in sdcMicro?

Using sdcMicro To Assess Privacy Risks

Let’s start by creating a sdcMicro object using the class sdcMicroObj. The object becomes the working environment for all subsequent anonymization and risk assessment steps. It stores

  • the original dataset,
  • the roles of variables within the dataset,
  • risk measures,
  • anonymized versions of the variables,
  • utility measures, and
  • the history of modifications.

sdcMicro updates risk and utilty estimates after each anonymization step.

For this object, we start with defining the following:

  • dat: the dataset we are working with

  • keyVars: indirect identifiers that are categorial (i.e., have categories) we want to assess the risk of (e.g., age, gender)

  • numVars: continuous (i.e., numeric) indirect identifiers (e.g., income)

sdcMicro needs, at the least, the data and categorial indirect identifiers to assess risks. We can later add more information to the object.

NoteMore Functions of sdcMicro
  • extractManipData(sdcObject)
    Extracts the anonymized dataset.

  • undolast(sdcObject)
    Reverts the last anonymization step.

  • get.sdcMicroObj()
    Accesses internal components of the object.

Add info on GUI usage

Exercise: Assessing Privacy Risks With sdcMicro

  1. Create an object within sdcMicro. Inspect the object.
  2. What k-anonymity level is currently reached?

Create the object. Provide gender, age, education, and postal code as categorial variables of interest (keyVars). The only numerical variables we are interested in are income and years in job (numVars).

library(sdcMicro)

sdc_data <- createSdcObj(
  dat = data_withoutdirectidentifiers, 
  keyVars = c("gender", "age", "education", "plz"),
  numVars = c("income", "years_in_job")
  )  

sdc_data # Summarizes the risks overall
The input dataset consists of 200 rows and 13 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   52   (52)     3.846  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                  152  (152)     1.316  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                     1    (1)
                     9    (9)
                     1    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 198 (99.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 0.00
  - Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------

sdcMicro shows us the unmodified, original parameters in parentheses. Right now, they are, of course, identical to the current parameters as we have not made any changes yet.

For the categorial key variables, we get information on the numbers and sizes of categories. The more categories and the smaller the number of entries in each category, the easier the de-identification as we have many unique individual values.

We also get information on k-anonymity of the combinations of these categorial variables: Currently, all observations breach 3- and 5-anonymity. There are only two persons in our dataset who share the same gender, age, education, plz, and years_in_job; therefore, our dataset also breaches 2-anonymity. 99% of participants are unique, meaning they have a unique combination on these demographic variables.

Upon closer inspection, these entries are IDs 94 and 122, who share many demographic details: both are 18 years old, live in Munich at postal code 80636, and have a high school degree. If I knew my male, 18-year-old friend living at 80636 had participated in this study, I could not identify him based on a combination of thes data points alone. However, if I knew his job title, I could still identify him. Let’s keep this in mind for further analyses.

knitr::kable(data_withoutdirectidentifiers[data_withoutdirectidentifiers$id %in% c(94, 122), ])
id plz gender age income years_in_job religion job_title education pol_immigration pol_environment pol_redistribution pol_eu_integration
94 94 80636 male 18 38100.28 1 None Scientist, product/process development high school 3 3 3 2
122 122 80636 male 18 37003.44 1 None Sound technician, broadcasting/film/video high school 2 5 2 2

This means that we currently do not reach any level of k-anonymity, since every other person has a unique combination.

For the numerical key variables, sdcMicro shows the disclosure risk. This parameter is a relative measure: it shows how much the parameters were changed after anonymization. Therefore, right now, sdcMicro does not know anything about the disclosure risk and makes the non-estimate that it is somewhere between 0 and 1.

Other risk estimates

sdcMicro offers even more risk estimates:

# Risk summary
print(sdc_data, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 199.00 (99.50 %)
# Individual risk
individualrisk <- get.sdcMicroObj(sdc_data, "risk")$individual

The individual risk shows a risk estimate for each individual in the dataset. fk is the frequency count in the sample and shows how many individuals share the same combination of key variables currently in our sdcObject. We can see that this is 1 for all participants except the IDs 94 and 122 I discussed above. Fk is the estimated frequency in the population; this only differs from fk in case of pre-defined sampling weights.

NoteThere Is No One Correct Variable Assignment

How you assign variables to keyVars and numVars is a judgment call - there is no universally correct answer. It depends on who you think is likely to try to re-identify someone, and what information they would already have.

The most realistic attack scenario for this dataset is probably someone who knows a study participant personally - a colleague, friend, or family member. Such a person would most likely know basic demographic attributes: the participant’s gender, approximate age, postal code, and level of education. These are the variables an attacker is most plausibly able to use to find a specific record, which is why we treat them as keyVars.

income is placed in numVars because it is less likely to be known in advance - it is something an attacker might learn from the dataset, not something they use to search it.

years_in_job is a borderline case. It behaves like a continuous variable, but a colleague might know roughly how long someone has been in their job. Placing it in keyVars would be the conservative choice. In this case, I consider it less likely to be known, which is why i put it into numVars

The same logic applies to religion: if a plausible attacker is someone who knows the participant from a religious community, then religion belongs in keyVars, not treated as a sensitive outcome. The variables you choose as keys should reflect the attack scenario you consider most realistic for your specific sample and context.

At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need anonymization but can skip to preparing your data for publication.

Save the sdcObject for further anonymization steps.

saveRDS(sdc_data, "../sdc_micro.rds")

If the risk is not acceptable - as is the case here - we will start with the actual anonymization process. To that end, I will present you several techniques over the next few chapters.

Learning Objective

  • After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.

Exercises

  • calculate the privacy risk for certain variables
Back to top