Explain issue of data protection vs. data sharing a bit more
Open science and data protection can seem like they pull in opposite directions, but in practice they enable each other. Sharing data builds trust in research findings, while protecting participants builds trust in researchers. When done well, anonymization is the bridge between these goals - it lets you share data openly while honoring the privacy expectations of the people who provided it (Jansen et al. 2025).
Measuring Privacy
Calculate privacy levels (k-anonymity) and compare to the value measured before
Include exercise on calculating the privacy level with k-anonymity like before in 2_1_Planning Privacy
Measuring Utility
How do we know whether our anonymized data is still useful? There are several approaches to measuring utility, ranging from simple comparisons to formal indices.
Provide overview of utility measurements and maybe try out one
predictive performance measures for machine learning (potentially as call-out box)
information loss measures:
distance/distribution comparisons
a penalty of transformations through generalisation and suppression
statistical differences
assessing utility for the current use case
perform statistical analysis before and after anonymization
see if results are comparable
Include exercise for calculating one information loss measure
Striking the Right Balance
iterative process: rework anonymization after measuring both
There are no universal thresholds for “good enough” privacy or utility. A commonly used rule of thumb is that k-anonymity should be at least 3-5 for most research datasets, meaning every combination of quasi-identifiers appears at least 3-5 times. For utility, IL1 values close to 0 and eigenvalue ratios close to 1 indicate low distortion. But ultimately, the right balance depends on your specific context: how sensitive is the data, who will have access, and what analyses need to be supported. When in doubt, consult your institution’s data protection officer or data steward.
Link to people to contact at LMU (data stewards? open science team?)
Exercise
Evaluate the utility of the dataset using sdcMicro functions.
After anonymization, you need to check that the data is still useful - that the patterns you care about (e.g., the relationship between religion and political opinion) have not been destroyed. sdcMicro provides utility measures, but it is also worth comparing distributions and running the actual analysis on both the original and anonymised data.
Key utility concepts:
IL1 — average absolute difference between original and anonymised numeric values (lower = better)
Eigen — compares covariance structure; values close to 1 mean the relationships between variables are preserved
utility() / print(..., type = "utility") — overall summary
Visual comparison — plotting distributions side-by-side is often the most intuitive check
Exercise: Evaluating Data Utility
Use the sdc_micro results from the previous exercises to compare it to the original data.
Print the utility summary of sdc_micro. What do IL1 and the eigen measure tell you?
Compare the income distribution before and after anonymisation. Plot histograms of the original and anonymised income side-by-side.
Re-run the analysis that motivated data collection: compute the mean political opinion scores (pol_immigration, pol_redistribution) by religion group on both the original and anonymised data. Are the group differences preserved?
Reflect: based on your findings, is the anonymised dataset fit for purpose? Would you adjust any of the anonymisation steps?
NoteSolution
Step 1 – Utility summary
sdc_micro
The input dataset consists of 200 rows and 13 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 3 (3) 66.667 (66.667)
age 66 (66) 3.030 (3.030)
education 5 (5) 40.000 (40.000)
plz 161 (161) 1.242 (1.242)
Size of smallest (>0)
<char> <char>
11 (11)
1 (1)
5 (5)
1 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 200 (100.000%)
- 3-anonymity: 200 (100.000%)
- 5-anonymity: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk is currently between [0.00%; 100.00%]
Current Information Loss:
- IL1: 7.73
- Difference of Eigenvalues: 0.010%
----------------------------------------------------------------------
IL1 close to 0 indicates little numeric distortion. Eigen values close to 1 indicate that the covariance structure between numeric variables is preserved. If you applied strong microaggregation (small k → large groups), you will see higher IL1 and eigen values further from 1.
Step 2 – Compare income distributions
library(ggplot2)# Extract anonymised income from the sdc objectincome_original <- data_withoutdirectidentifiers$incomeincome_anonymised <-get.sdcMicroObj(sdc_micro, "manipNumVars")$incomeincome_compare <-tibble(income =c(income_original, income_anonymised),version =rep(c("Original", "Anonymised"), each =length(income_original)))ggplot(income_compare, aes(x = income, fill = version)) +geom_histogram(alpha =0.6, bins =30, position ="identity") +scale_fill_manual(values =c("Original"="#2E4057", "Anonymised"="#E07A5F")) +labs(title ="Income distribution: original vs. anonymised",x ="Annual income (€)", y ="Count", fill =NULL ) +theme_minimal()
The shapes should look similar. A heavily chunked distribution (tall bars at regular intervals) would indicate that microaggregation groups were too large.
Check data: anonymized data virtually identical with non-anonymized
Step 3 – Check whether the key analysis still works
If the group means are close between both summaries, the anonymisation has preserved the analytical signal. Political opinion variables were not altered (they are neither key variables nor numeric variables in the sdcMicro object), so differences here would only arise if the grouping structure changed dramatically.
Step 4 – Reflection
There is no single correct answer. Consider:
If income distributions look very similar and group means are preserved → the anonymisation is fit for purpose for this analysis
If you used k = 5 and results still look good, increasing to k = 3 would reduce distortion while still providing some protection, but weigh this against the risk reduction you found in the earlier exercises
Variables like plz_region and age_group were generalized; if a future analysis requires finer geographic or age breakdowns, you would need to rework the anonymization strategy
Learning Objective
After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.
Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.”ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.