predictive performance measures for machine learning (potentially as call-out box)
information loss measures:
distance/distribution comparisons
a penalty of transformations through generalisation and suppression
statistical differences
assessing utility for the current use case
perform statistical analysis before and after anonymization
see if results are comparable
Include exercise for calculating one information loss measure
Striking the Right Balance
iterative process: rework anonymization after measuring both
explain the norms for acceptable values of privacy and utility indicators
People to contact at LMU (data stewards? open science team?)
Exercise
Evaluate the utility of the dataset using sdcMicro functions
After anonymisation, you need to check that the data is still useful — that the patterns you care about (e.g., the relationship between religion and political opinion) have not been destroyed. sdcMicro provides utility measures, but it is also worth comparing distributions and running the actual analysis on both the original and anonymised data.
Key utility concepts:
IL1 — average absolute difference between original and anonymised numeric values (lower = better)
Eigen — compares covariance structure; values close to 1 mean the relationships between variables are preserved
utility() / print(..., type = "utility") — overall summary
Visual comparison — plotting distributions side-by-side is often the most intuitive check
Exercise: Evaluating Data Utility
Use sdc_micro (the microaggregation result from the previous exercise) and compare it to the original data.
Print the utility summary of sdc_micro. What do IL1 and the eigen measure tell you?
Compare the income distribution before and after anonymisation. Plot histograms of the original and anonymised income side-by-side.
Re-run the analysis that motivated data collection: compute the mean political opinion scores (pol_immigration, pol_redistribution) by religion group on both the original and anonymised data. Are the group differences preserved?
Reflect: based on your findings, is the anonymised dataset fit for purpose? Would you adjust any of the anonymisation steps?
NoteSolution
Step 1 – Utility summary
sdc_micro
The input dataset consists of 200 rows and 13 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 3 (3) 66.667 (66.667)
age 66 (66) 3.030 (3.030)
education 5 (5) 40.000 (40.000)
plz 161 (161) 1.242 (1.242)
Size of smallest (>0)
<char> <char>
11 (11)
1 (1)
5 (5)
1 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 200 (100.000%)
- 3-anonymity: 200 (100.000%)
- 5-anonymity: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk is currently between [0.00%; 100.00%]
Current Information Loss:
- IL1: 7.73
- Difference of Eigenvalues: 0.010%
----------------------------------------------------------------------
IL1 close to 0 indicates little numeric distortion. Eigen values close to 1 indicate that the covariance structure between numeric variables is preserved. If you applied strong microaggregation (small k → large groups), you will see higher IL1 and eigen values further from 1.
Step 2 – Compare income distributions
library(ggplot2)# Extract anonymised income from the sdc objectincome_original <- data_withoutdirectidentifiers$incomeincome_anonymised <-get.sdcMicroObj(sdc_micro, "manipNumVars")$incomeincome_compare <-tibble(income =c(income_original, income_anonymised),version =rep(c("Original", "Anonymised"), each =length(income_original)))ggplot(income_compare, aes(x = income, fill = version)) +geom_histogram(alpha =0.6, bins =30, position ="identity") +scale_fill_manual(values =c("Original"="#2E4057", "Anonymised"="#E07A5F")) +labs(title ="Income distribution: original vs. anonymised",x ="Annual income (€)", y ="Count", fill =NULL ) +theme_minimal()
The shapes should look similar. A heavily chunked distribution (tall bars at regular intervals) would indicate that microaggregation groups were too large.
Check error: anonymized data virtually identical with non-anonymized
Step 3 – Check whether the key analysis still works
If the group means are close between both summaries, the anonymisation has preserved the analytical signal. Political opinion variables were not altered (they are neither key variables nor numeric variables in the sdcMicro object), so differences here would only arise if the grouping structure changed dramatically.
Step 4 – Reflection
There is no single correct answer. Consider:
If income distributions look very similar and group means are preserved → the anonymisation is fit for purpose for this analysis
If you used k = 5 and results still look good, increasing to k = 3 would reduce distortion while still providing some protection — but weigh this against the risk reduction you found in the earlier exercises
Variables like plz_region and age_group were generalised; if a future analysis requires finer geographic or age breakdowns, you would need to negotiate a different anonymisation strategy with the data controller
Learning Objective
After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.
Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.”ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.