Balancing Utility and Privacy

Explain issue of data protection vs. data sharing a bit more

Open science and data protection can seem like they pull in opposite directions, but in practice they enable each other. Sharing data builds trust in research findings, while protecting participants builds trust in researchers. When done well, anonymization is the bridge between these goals - it lets you share data openly while honoring the privacy expectations of the people who provided it (Jansen et al. 2025).

Measuring Privacy

  • Calculate privacy levels (k-anonymity) and compare to the value measured before

  • Red teaming data anonymization (Jansen et al. 2026)

Include exercise on calculating the privacy level with k-anonymity like before in 2_1_Planning Privacy

Measuring Utility

How do we know whether our anonymized data is still useful? There are several approaches to measuring utility, ranging from simple comparisons to formal indices.

Provide overview of utility measurements and maybe try out one

  • utility indices (Carvalho et al. 2023)
    • predictive performance measures for machine learning (potentially as call-out box)
    • information loss measures:
      • distance/distribution comparisons
      • a penalty of transformations through generalisation and suppression
      • statistical differences
  • assessing utility for the current use case
    • perform statistical analysis before and after anonymization
    • see if results are comparable

Include exercise for calculating one information loss measure

Striking the Right Balance

  • iterative process: rework anonymization after measuring both

There are no universal thresholds for “good enough” privacy or utility. A commonly used rule of thumb is that k-anonymity should be at least 3-5 for most research datasets, meaning every combination of quasi-identifiers appears at least 3-5 times. For utility, IL1 values close to 0 and eigenvalue ratios close to 1 indicate low distortion. But ultimately, the right balance depends on your specific context: how sensitive is the data, who will have access, and what analyses need to be supported. When in doubt, consult your institution’s data protection officer or data steward.

Link to people to contact at LMU (data stewards? open science team?)

Exercise

Evaluate the utility of the dataset using sdcMicro functions.

After anonymization, you need to check that the data is still useful - that the patterns you care about (e.g., the relationship between religion and political opinion) have not been destroyed. sdcMicro provides utility measures, but it is also worth comparing distributions and running the actual analysis on both the original and anonymised data.

Key utility concepts:

  • IL1 — average absolute difference between original and anonymised numeric values (lower = better)
  • Eigen — compares covariance structure; values close to 1 mean the relationships between variables are preserved
  • utility() / print(..., type = "utility") — overall summary
  • Visual comparison — plotting distributions side-by-side is often the most intuitive check

Exercise: Evaluating Data Utility

Use the sdc_micro results from the previous exercises to compare it to the original data.

  1. Print the utility summary of sdc_micro. What do IL1 and the eigen measure tell you?

  2. Compare the income distribution before and after anonymisation. Plot histograms of the original and anonymised income side-by-side.

  3. Re-run the analysis that motivated data collection: compute the mean political opinion scores (pol_immigration, pol_redistribution) by religion group on both the original and anonymised data. Are the group differences preserved?

  4. Reflect: based on your findings, is the anonymised dataset fit for purpose? Would you adjust any of the anonymisation steps?

Step 1 – Utility summary

sdc_micro
The input dataset consists of 200 rows and 13 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   66   (66)     3.030  (3.030)
    education                    5    (5)    40.000 (40.000)
          plz                  161  (161)     1.242  (1.242)
 Size of smallest (>0)       
                <char> <char>
                    11   (11)
                     1    (1)
                     5    (5)
                     1    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 200 (100.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 7.73
  - Difference of Eigenvalues: 0.010%
----------------------------------------------------------------------

IL1 close to 0 indicates little numeric distortion. Eigen values close to 1 indicate that the covariance structure between numeric variables is preserved. If you applied strong microaggregation (small k → large groups), you will see higher IL1 and eigen values further from 1.

Step 2 – Compare income distributions

library(ggplot2)

# Extract anonymised income from the sdc object
income_original   <- data_withoutdirectidentifiers$income
income_anonymised <- get.sdcMicroObj(sdc_micro, "manipNumVars")$income

income_compare <- tibble(
  income = c(income_original, income_anonymised),
  version = rep(c("Original", "Anonymised"), each = length(income_original))
)

ggplot(income_compare, aes(x = income, fill = version)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  scale_fill_manual(values = c("Original" = "#2E4057", "Anonymised" = "#E07A5F")) +
  labs(
    title = "Income distribution: original vs. anonymised",
    x = "Annual income (€)", y = "Count", fill = NULL
  ) +
  theme_minimal()

The shapes should look similar. A heavily chunked distribution (tall bars at regular intervals) would indicate that microaggregation groups were too large.

Check data: anonymized data virtually identical with non-anonymized

Step 3 – Check whether the key analysis still works

# Analysis on original data
original_summary <- data_withoutdirectidentifiers %>%
  group_by(religion) %>%
  summarise(
    mean_immigration    = mean(pol_immigration,    na.rm = TRUE),
    mean_redistribution = mean(pol_redistribution, na.rm = TRUE),
    n = n()
  )

# Rebuild anonymised dataset with the manipulated variables
anon_data <- get.sdcMicroObj(sdc_micro, "manipKeyVars") %>%
  bind_cols(get.sdcMicroObj(sdc_micro, "manipNumVars")) %>%
  bind_cols(
    data_withoutdirectidentifiers %>% select(religion, starts_with("pol_"))
  )

anonymised_summary <- anon_data %>%
  group_by(religion) %>%
  summarise(
    mean_immigration    = mean(pol_immigration,    na.rm = TRUE),
    mean_redistribution = mean(pol_redistribution, na.rm = TRUE),
    n = n()
  )

original_summary
# A tibble: 7 × 4
  religion          mean_immigration mean_redistribution     n
  <chr>                        <dbl>               <dbl> <int>
1 Buddhism                      4                   3.33     3
2 Catholicism                   2.73                2.79    56
3 Eastern Orthodoxy             4                   3        1
4 Islam                         2.69                3.23    13
5 Judaism                       1.5                 3        2
6 None                          2.97                3.17    75
7 Protestantism                 3.04                3.12    50
anonymised_summary
# A tibble: 7 × 4
  religion          mean_immigration mean_redistribution     n
  <chr>                        <dbl>               <dbl> <int>
1 Buddhism                      4                   3.33     3
2 Catholicism                   2.73                2.79    56
3 Eastern Orthodoxy             4                   3        1
4 Islam                         2.69                3.23    13
5 Judaism                       1.5                 3        2
6 None                          2.97                3.17    75
7 Protestantism                 3.04                3.12    50

If the group means are close between both summaries, the anonymisation has preserved the analytical signal. Political opinion variables were not altered (they are neither key variables nor numeric variables in the sdcMicro object), so differences here would only arise if the grouping structure changed dramatically.

Step 4 – Reflection

There is no single correct answer. Consider:

  • If income distributions look very similar and group means are preserved → the anonymisation is fit for purpose for this analysis
  • If you used k = 5 and results still look good, increasing to k = 3 would reduce distortion while still providing some protection, but weigh this against the risk reduction you found in the earlier exercises
  • Variables like plz_region and age_group were generalized; if a future analysis requires finer geographic or age breakdowns, you would need to rework the anonymization strategy

Learning Objective

  • After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.

Exercises

  • apply measurement of k-anonymity

  • apply measurement of utility

Back to top

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.