Assessing Privacy Risks

Learning Objective

After completing this part of the tutorial, you will be able to calculate individual privacy risks for continuous variables by utilizing k-anonymity.

In the previous chapters, we covered the foundations: What counts as personal data, what the GDPR requires, what anonymization mechanisms exist, and what privacy risks look like. Now we move to the practical part—actually anonymizing data. But before we jump into specific techniques, we first need to assess how much risk our data carries. That is what this chapter is about.

Anonymization of data, in most cases, means masking or obscuring the data, which always results in a loss of information. However, when sharing data, we need to retain some of the data’s utility; otherwise, they won’t have any potential for reuse.

The goal is to use the appropriate anonymization techniques to achieve an acceptable level of risk for each individual, while at the same time, keeping utility as high as possible (see also the chapter on balancing privacy and utility).

To achieve that, before starting the anonymization process, we will assess the level of privacy risk in our data and investigate potential risks, mobilizing what we have learned in the last chapter and putting it into practice.

Identifying Privacy Risks

Exercise: Privacy Risks in the Example Data

Take another look at our dataset, as modified in the exercise on the personal data page where we removed the direct identifiers. Try to answer the following questions:

Which combinations of indirect identifiers could potentially single out an individual?
What attack scenarios are plausible? Think about who might try to re-identify someone in this dataset and what external information they might have access to.
What type of disclosure is most likely for this dataset: identity disclosure, attribute disclosure, or membership disclosure?

Solution

Here are a few concrete attack scenarios to consider, based on our dataset of 200 individuals in Germany (with variables: gender, age, education, plz, income, religion, years_in_job, and political opinion variables):

A colleague knows that a specific person lives in a particular postal code area, holds a doctoral degree, and is 34 years old. If the combination of plz, education, and age is unique in the dataset, that colleague can identify the person and look up their income, political opinions, and religion. This is identity disclosure. In this scenario, the attacker has the knowledge that his colleague has participated in the study. In practice, when calculating risks, we usually assume no such knowledge (see the box on participation knowledge).
A journalist cross-references the dataset with publicly available census or salary reports. By matching education level, income range, and age, they could narrow down candidates and potentially identify individuals—especially in smaller postal code areas where few people share the same combination of characteristics.
A neighbor knows that someone in their building participated in the study. They know the person’s approximate age, gender, and religion. Even if they cannot pinpoint the exact row, they might learn something new about that person’s income or political views. This is attribute disclosure—the attacker does not need to find the exact record, just a group of records that all share the same sensitive value.

For this dateset, identity and attribute disclosure are most likely. Since there is not one single group that was recruited, membership disclosure is not possible.

These are just a few examples; in general, the existence and likelihood of risks are very much dependent on the variables in the dataset, the sensitivity of data, the recruitment strategy, and many other factors (see this chapter on choosing the right technique for more information).

Measuring Privacy Risks

The `sdcMicro` Package

To formally measure privacy risks (and for most of the following exercises), we will use the package sdcMicro in R (Templ et al. 2015).

sdcMicro is an R package specialized in the anonymization of micro (small) data sets. sdcMicro offers a command line and a Graphical User interface (GUI).

Using `sdcMicro` to Assess Privacy Risks

Let’s start by creating a sdcMicro object of the class sdcMicroObj, using the function createSdcObj. The object becomes the working environment for all subsequent anonymization and risk assessment steps. It stores

the original dataset,
the roles of variables within the dataset,
risk measures,
anonymized versions of the variables,
utility measures, and
the history of modifications.

sdcMicro updates risk and utilty estimates after each anonymization step.

For this object, we start with defining the following:

dat: the dataset we are working with
keyVars: indirect identifiers that are categorial (i.e., have categories) we want to assess the risk of (e.g., age, gender)
numVars: continuous (i.e., numeric) indirect identifiers (e.g., income)

sdcMicro needs, at the least, the data and categorial indirect identifiers to assess risks.

More Functions of sdcMicro

extractManipData(sdcObject)
Extracts the anonymized dataset.
undolast(sdcObject)
Reverts the last anonymization step.
get.sdcMicroObj()
Accesses internal components of the object.
sdcApp()Opens the GUI: The GUI provides a nice overview of functions. In case you decide to use it (and I recommend to try it out), create an anonymization report and the end via “export data” to make your anonymization transparent.
AI_createSdcObj() and AI_applyAnonymization()
These new functions use LLMs to assist with the creation of an sdcObject and anonymization. See the sdcMicro vignette on AI-assisted anonymization for how to use them. We won’t use these in the context of the tutorial, to learn the processes manually first.

Exercise: Assessing Privacy Risks With `sdcMicro`

In this exercise, you will assess privacy risks with sdcMicro. We will work with the dataset data_withoutdirectidentifiers—our dataset as loaded from the landing page and modified in the exercise on the personal data page, which removed the direct identifiers.

Create an object within sdcMicro that contains the relevant keyVars and numVars using the function createSdcObj. Inspect the object.
What k-anonymity level is currently reached?

Solution

Create the object. Provide gender, age, education, and postal code as categorial variables of interest (keyVars). The only numerical variables we are interested in are income and years in job (numVars).

library(sdcMicro)

# Create the sdcObject and assess privacy risks
sdc_data <- createSdcObj(
  dat = data_withoutdirectidentifiers,
  keyVars = c("gender", "age", "education", "plz"),
  numVars = c("income", "years_in_job")
  )

sdc_data # Summarizes the risks overall

The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------

Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!

 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   52   (52)     3.846  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                  152  (152)     1.316  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                     1    (1)
                     9    (9)
                     1    (1)

----------------------------------------------------------------------

Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 198 (99.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------

Numerical key variables: income, years_in_job

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 0.00
  - Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------

sdcMicro shows us the unmodified, original parameters in parentheses within the summary of the sdcObject. Right now, the original parameters are, of course, identical to the current parameters as we have not made any changes yet.

For the categorial key variables, we get information on the numbers and sizes of categories. The more categories and the smaller the number of entries in each category, the easier the de-identification as we have many unique individual values.

We also get information on k-anonymity of the combinations of these categorial variables: Currently, all observations breach 3- and 5-anonymity. There are only two persons in our dataset who share the same gender, age, education, plz, and years_in_job; therefore, our dataset also breaches 2-anonymity. 99% of participants are unique, meaning they have a unique combination on these demographic variables.

Upon closer inspection, these entries are IDs 94 and 122, who share many demographic details: both are 18 years old, live in Munich at postal code 80636, and have a high school degree. If I knew a 18-year-old male living at 80636 that had participated in this study, I could not identify him based on a combination of these data points alone. However, if I knew his exact income, I could still identify him. Let’s keep this in mind for further analyses.

# Show the two records that share a key-variable combination
knitr::kable(data_withoutdirectidentifiers[data_withoutdirectidentifiers$id %in% c(94, 122), ])

	id	plz	gender	age	income	years_in_job	religion	education	pol_immigration	pol_environment	pol_redistribution	pol_eu_integration
94	94	80636	male	18	38100.28	1	None	high school	3	3	3	2
122	122	80636	male	18	37003.44	1	None	high school	2	5	2	2

This means that we currently do not reach any level of k-anonymity, since every other person has a unique combination.

For the numerical key variables, sdcMicro shows the disclosure risk. This parameter is a relative measure: it shows how much the parameters were changed after anonymization. As we have not changed the original data yet, right now, sdcMicro does not know anything about the disclosure risk and makes the non-estimate that it is somewhere between 0 and 1.

Other risk estimates

sdcMicro offers even more risk estimates:

# Risk summary
print(sdc_data, type = "risk")

Risk measures:

Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 199.00 (99.50 %)

# Individual risk
individualrisk <- get.sdcMicroObj(sdc_data, "risk")$individual

The individual risk shows a risk estimate for each individual in the dataset. fk is the frequency count in the sample and shows how many individuals share the same combination of key variables currently in our sdcObject. We can see that this is 1 for all participants except the IDs 94 and 122 I discussed above. Fk is the estimated frequency in the population; this only differs from fk in case of pre-defined sampling weights.

There Is No One Correct Variable Assignment

How you assign variables to keyVars and numVars is a judgment call—there is no universally correct answer. It depends on who you think is likely to try to re-identify someone, and what information they would already have.

The most realistic attack scenario for this dataset is probably someone who knows a study participant personally—a colleague, friend, or family member. Such a person would most likely know basic demographic attributes: the participant’s gender, approximate age, postal code, and level of education. These are the variables an attacker is most plausibly able to use to find a specific record, which is why we treat them as keyVars.

income is placed in numVars as it is a continuous variable.

years_in_job is a borderline case. It behaves like a continuous variable, but a colleague might know roughly how long someone has been in their job. Placing it in keyVars would be the conservative choice. In this case, I consider it less likely to be known exactly, which is why I put it into numVars

The same logic applies to religion: if a plausible attacker is someone who knows the participant from a religious community, then religion belongs in keyVars, not treated as a sensitive outcome. The variables you choose as keys should reflect the attack scenario you consider most realistic for your specific sample and context. In this case, I think it is not likely to be known by an attacker.

Save the sdcObject for further anonymization steps.

# Save the sdcObject for the next chapter
saveRDS(sdc_data, here::here("sdc_micro.rds"))

At this point, you decide whether the level of privacy risks is acceptable. If so, you do not need to anonymize; you can skip to preparing your data for publication.

If the risk is not acceptable—as is the case here—we will start with the actual anonymization process. To that end, I will present you with several techniques over the next few chapters.

Resources, Links, Examples

More information on assessing risks for public research data: Morehouse et al. (2025)

References

Morehouse, Kirsten N., Benedek Kurdi, and Brian A. Nosek. 2025. “Responsible Data Sharing: Identifying and Remedying Possible Re-Identification of Human Participants.” American Psychologist 80 (6): 928–41. https://doi.org/10.1037/amp0001346.

Templ, Matthias, Alexander Kowarik, and Bernhard Meindl. 2015. “Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro.” Journal of Statistical Software 67 (4). https://doi.org/10.18637/jss.v067.i04.

Identifying Privacy Risks

Exercise: Privacy Risks in the Example Data

Measuring Privacy Risks

The sdcMicro Package

Using sdcMicro to Assess Privacy Risks

Exercise: Assessing Privacy Risks With sdcMicro

Resources, Links, Examples

References

The `sdcMicro` Package

Using `sdcMicro` to Assess Privacy Risks

Exercise: Assessing Privacy Risks With `sdcMicro`