After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.
As we have seen, anonymization techniques are all about hiding or obscuring the data or links within the data. This, of course, comes at a cost: The data may lose some of its utility.
We therefore need to find a well-calibrated balance between utility and anonymization, the famous sweet spot where we share data “as open as possible, as closed as necessary.” To achieve this, we can utilize various measures - many of which we have already seen in the chapter on assessing privacy risks and in various sdcMicro outputs.
Measuring Privacy
The most straightforward way to measure privacy after anonymization is to compare k-anonymity before and after. In the chapter on assessing privacy risks, we found that nearly every participant in the original dataset had a unique combination of key variables. After applying non-perturbative techniques in the chapter on non-perturbative techniques, we achieved 3-anonymity - every record shares its key variable combination with at least two others. Printing the sdcObject gives you the current k-anonymity level directly:
sdc_nonpert
The input dataset consists of 200 rows and 12 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 92.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 495.71
Difference of Eigenvalues: 6.200%
----------------------------------------------------------------------
The k-anonymity and disclosure risks are objective values, but always need to be considered in context: What variables did we define within the sdcObject? What external data exist? What is the sampling strategy and what is known about that?
In practice, it can be challenging to correctly assess these risks given the complexity of research projects and existing information. In such cases, red teaming the data anonymization can help (Jansen et al. 2026): Another person (e.g., another member of your lab) is instructed to play the role of an attacker, by, e.g., searching for external information or singling out the least protected individuals.
In general, it is worth it to check the assumptions of the anonymization process every few years after publication of the data, especially if new techniques have emerged (see callout box below).
NoteDeanonymization Using AI
AI tools are increasingly being used to re-identify individuals from datasets that were considered anonymous. Recent research shows that large language models (LLMs) can match anonymous text profiles to named individuals across platforms - for example, linking an anonymous forum post to a person’s identified social media account (Lermen et al., n.d.). Separately, research has shown that individuals can be re-identified from behavioral traces alone, including web browsing history, geolocation data, and even face scans, by combining datasets that each seem innocuous on their own (Rocher et al. 2025).
For quantitative survey or experimental data of the kind covered in this tutorial, I do not expect AI tools to fundamentally change the re-identification risk - the attack surface is just not the same as for text or behavioral data. That said, AI can make an attacker more efficient: tasks like searching for external information, generating plausible attack combinations, or automating record linkage are all easier with modern AI tools.
Measuring Utility
How do we know whether our anonymized data is still useful? There are several approaches to measuring utility, ranging from simple comparisons to formal indices.
sdcMicro provides two built-in information loss measures for numeric variables. IL1 is the average absolute difference between original and anonymized values, relative to the original. A value of 0 means no change; higher values mean more distortion. The eigenvalue ratio compares the eigenvalues of the original and anonymized data. Values close to 0 indicate the least changes. These indicators are primarily useful to compare different anonymization methods with each other.
Both values are part of the standard sdcMicro output after applying any anonymization technique.
Exercise: Evaluating Data Utility
Using sdcMicro’s Parameters
Let’s go back to the two sdcObjects we created in the chapter on perturbative techniques: sdc_micro (where we applied microaggregation) and sdc_noise (where we applied noise).
Compare the utility with IL1 and the difference of eigenvalues. Which method do you prefer when looking at these utility indices?
Make a decision for one of the methods and export the data for further use.
ImportantSolution
Simply inspect the sdcObjects to see the values of IL1 and the eigenvalue difference:
sdc_micro
The input dataset consists of 200 rows and 12 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 77.50%]
Current Information Loss in modified data (0.00% in original data):
IL1: 1029.80
Difference of Eigenvalues: 6.100%
----------------------------------------------------------------------
The input dataset consists of 200 rows and 12 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 76.50%]
Current Information Loss in modified data (0.00% in original data):
IL1: 893.79
Difference of Eigenvalues: 6.460%
----------------------------------------------------------------------
Both indices prefer the microaggregation approach: The IL1 is much smaller, indicating that values are closer to the original data, and the difference of eigenvalues is slightly smaller, also indicating that microaggregation preserves the most utility.
Now I can extract the dataset from the sdc_micro object. Because sdcMicro stores recoded key variables as integer codes internally, I need to restore the labels manually using mutate:
As a last anonymization step, I will now generalize rare religion categories using fct_lump_min() from forcats. Since religion is not a variable in the sdcObject, I apply this directly to the dataset as an additional layer, protecting individuals with rare religions.
data_anonymized <- data_anonymized %>%mutate(religion =fct_lump_min(as.factor(religion), min =10, other_level ="Other"))table(data_anonymized$religion)
Catholicism Islam None Protestantism Other
56 13 75 50 6
fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes.
Using Statistics
For non-numeric variables that were recoded or suppressed, there is no single formal measure - but you can compare frequency tables before and after. For example, comparing the distribution of age bands or religion categories tells you whether the recoding preserved the overall composition of the sample.
Beyond these built-in measures, the most practically relevant utility check is simply to re-run the analysis that motivated data collection on both the original and anonymized datasets and compare the results. For our dataset, that means checking whether the association between religion and political opinion is similar to before anonymization. If the key result holds, the anonymized data is fit for purpose.
Compare descriptive statistics on the indirect identifiers age, gender, education, income, plz, and years in job before and after anonymization.
Compare the relation between religion and the political opinion variables pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration before and after anonymization.
Reflect on these comparisons: Is utility well-enough preserved?
ImportantSolution
Descriptive statistics on indirect identifiers
Compare distributions before and after using summary() and frequency tables:
# Before anonymizationsummary(data_withoutdirectidentifiers[, c("age", "income", "years_in_job")])
age income years_in_job
Min. :18.00 Min. : 1508 Min. : 0.00
1st Qu.:30.00 1st Qu.: 36964 1st Qu.: 2.00
Median :41.00 Median : 52219 Median : 4.00
Mean :40.99 Mean : 60789 Mean : 5.34
3rd Qu.:50.00 3rd Qu.: 64582 3rd Qu.: 8.00
Max. :70.00 Max. :902234 Max. :32.00
table(data_withoutdirectidentifiers$gender, useNA ="always") # Make sure to show NAs as well
# After anonymizationsummary(data_anonymized[, c("income", "years_in_job")])
income years_in_job
Min. : 3576 Min. : 0.00
1st Qu.:37246 1st Qu.: 2.00
Median :52156 Median : 4.00
Mean :51130 Mean : 5.16
3rd Qu.:64451 3rd Qu.: 8.00
Max. :89186 Max. :15.00
table(data_anonymized$age, useNA ="always")
18-29 30-39 40-49 50-59 60+ <NA>
47 43 58 29 23 0
table(data_anonymized$gender, useNA ="always")
female male non-binary <NA>
104 86 4 6
table(data_anonymized$education, useNA ="always")
doctoral title high school trade school university <NA>
5 106 22 27 40
table(data_anonymized$plz, useNA ="always")
8xxxx other <NA>
108 90 2
Age and postal code are now categorical bands, so the numeric summary is no longer meaningful - use frequency tables instead. Income and years in job are still numeric after microaggregation; their medians are very close to the original, but especially income’s mean is lower now - this is due to the extreme outliers that were top-coded.
For gender, 6 non-binary individuals’ data was suppressed to achieve necessary k-anonymity.
For education, all individuals without a degree were recoded to NA; in each other category, few individuals’s data was suppressed. This might be a problem for data utility, especially if our goal was to show that we reached individuals from all educational levels. If this was the case, I’d recommend to go back to applying local suppression and to prioritize that educational level stays non-suppressed.
We lost, as intended, a lot of information on plz and age. For our use case, this is okay.
The group-level means should be very the same between the original and anonymized datasets. The key research question - whether religion predicts political opinion - is still answerable. Note that some small religion categories have been merged into “Other”, which changes the granularity of analysis but preserves the main patterns. Other than that, we did not change anything regarding these variables.
Reflection
For the purpose of this dataset - studying the relationship between religion and political opinion - utility is well-preserved. The main variable of interest (religion) is still present, just with fewer fine-grained categories. The political opinion variables are unchanged. Income and years in job show very similar distributions after microaggregation.
Striking the Right Balance
Balancing privacy and utility is not a one-shot decision - it is an iterative process. After measuring both, you may find that your current anonymization is too aggressive (data is well-protected but the key analysis no longer works) or too lenient (utility is preserved but too many records are still unique). In either case, you go back to the anonymization steps and adjust: changing recoding thresholds, relaxing or tightening the k-anonymity target, or choosing a different perturbative technique.
For our dataset, a reasonable workflow looks like this: start by checking the k-anonymity level and disclosure risk on sdc_nonpert (from the chapter on non-perturbative techniques), then check the information loss after applying perturbative techniques on sdc_micro or sdc_noise (from the chapter on perturbative techniques), and finally compare the religion-political opinion analysis on original and anonymized data. If the analysis still holds and k-anonymity is at least 3, the anonymization is likely sufficient for sharing.
There are no universal thresholds for “good enough” privacy or utility. A commonly used rule of thumb is that k-anonymity should be at least 3-5 for most research datasets, meaning every combination of indirect identifiers appears at least 3-5 times (see discussion above). For utility, IL1 values eigenvalue differences close to 0 indicate low distortion. But ultimately, the right balance depends on your specific context: how sensitive is the data, who will have access, and what analyses need to be supported. When in doubt, consult your institution’s data protection officer (here at LMU), data steward (or a local research data management team, here at LMU), or open science team (here at LMU).
Let’s keep in mind, that while there is a tension between data protection and openness, in practice, these principles enable each other: Sharing data builds trust in research findings, while protecting participants builds trust in researchers. When done well, anonymization is the bridge between these goals - it lets you share data openly while honoring the privacy expectations of the people who provided it (Jansen et al. 2025).
Resources, Links, Examples
Examples of decisions for balancing utility and privacy from the UK are documented here.
Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.
Lermen, Simon, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tramèr. n.d. Large-Scale Online Deanonymization with LLMs. https://doi.org/10.48550/ARXIV.2602.16800.
Rocher, Luc, Julien M. Hendrickx, and Yves-Alexandre De Montjoye. 2025. “A Scaling Law to Model the Effectiveness of Identification Techniques.”Nature Communications 16 (1): 347. https://doi.org/10.1038/s41467-024-55296-6.