Balancing Utility and Privacy

summarize ideas by Jansen et al. (2025)

Measuring Privacy

  • Calculate privacy levels (k-anonymity) and compare to the value measured before

  • Red teaming data anonymization (Jansen et al. 2026)

Include exercise on calculating the privacy level with k-anonymity like before in 2_1_Planning Privacy

Measuring Utility

Provide overview of utility measurements and maybe try out one

  • utility indices (Carvalho et al. 2023)
    • predictive performance measures for machine learning (potentially as call-out box)
    • information loss measures:
      • distance/distribution comparisons
      • a penalty of transformations through generalisation and suppression
      • statistical differences
  • assessing utility for the current use case
    • perform statistical analysis before and after anonymization
    • see if results are comparable

Include exercise for calculating one information loss measure

Striking the Right Balance

  • iterative process: rework anonymization after measuring both

  • explain the norms for acceptable values of privacy and utility indicators

  • People to contact at LMU (data stewards? open science team?)

Exercise

Evaluate the utility of the dataset using sdcMicro functions

After anonymisation, you need to check that the data is still useful — that the patterns you care about (e.g., the relationship between religion and political opinion) have not been destroyed. sdcMicro provides utility measures, but it is also worth comparing distributions and running the actual analysis on both the original and anonymised data.

Key utility concepts:

  • IL1 — average absolute difference between original and anonymised numeric values (lower = better)
  • Eigen — compares covariance structure; values close to 1 mean the relationships between variables are preserved
  • utility() / print(..., type = "utility") — overall summary
  • Visual comparison — plotting distributions side-by-side is often the most intuitive check

Exercise: Evaluating Data Utility

Use sdc_micro (the microaggregation result from the previous exercise) and compare it to the original data.

  1. Print the utility summary of sdc_micro. What do IL1 and the eigen measure tell you?

  2. Compare the income distribution before and after anonymisation. Plot histograms of the original and anonymised income side-by-side.

  3. Re-run the analysis that motivated data collection: compute the mean political opinion scores (pol_immigration, pol_redistribution) by religion group on both the original and anonymised data. Are the group differences preserved?

  4. Reflect: based on your findings, is the anonymised dataset fit for purpose? Would you adjust any of the anonymisation steps?

NoteSolution

Step 1 – Utility summary

sdc_micro
The input dataset consists of 200 rows and 13 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   66   (66)     3.030  (3.030)
    education                    5    (5)    40.000 (40.000)
          plz                  161  (161)     1.242  (1.242)
 Size of smallest (>0)       
                <char> <char>
                    11   (11)
                     1    (1)
                     5    (5)
                     1    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 200 (100.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 7.73
  - Difference of Eigenvalues: 0.010%
----------------------------------------------------------------------

IL1 close to 0 indicates little numeric distortion. Eigen values close to 1 indicate that the covariance structure between numeric variables is preserved. If you applied strong microaggregation (small k → large groups), you will see higher IL1 and eigen values further from 1.

Step 2 – Compare income distributions

library(ggplot2)

# Extract anonymised income from the sdc object
income_original   <- data_withoutdirectidentifiers$income
income_anonymised <- get.sdcMicroObj(sdc_micro, "manipNumVars")$income

income_compare <- tibble(
  income = c(income_original, income_anonymised),
  version = rep(c("Original", "Anonymised"), each = length(income_original))
)

ggplot(income_compare, aes(x = income, fill = version)) +
  geom_histogram(alpha = 0.6, bins = 30, position = "identity") +
  scale_fill_manual(values = c("Original" = "#2E4057", "Anonymised" = "#E07A5F")) +
  labs(
    title = "Income distribution: original vs. anonymised",
    x = "Annual income (€)", y = "Count", fill = NULL
  ) +
  theme_minimal()

The shapes should look similar. A heavily chunked distribution (tall bars at regular intervals) would indicate that microaggregation groups were too large.

Check error: anonymized data virtually identical with non-anonymized

Step 3 – Check whether the key analysis still works

# Analysis on original data
original_summary <- data_withoutdirectidentifiers %>%
  group_by(religion) %>%
  summarise(
    mean_immigration    = mean(pol_immigration,    na.rm = TRUE),
    mean_redistribution = mean(pol_redistribution, na.rm = TRUE),
    n = n()
  )

# Rebuild anonymised dataset with the manipulated variables
anon_data <- get.sdcMicroObj(sdc_micro, "manipKeyVars") %>%
  bind_cols(get.sdcMicroObj(sdc_micro, "manipNumVars")) %>%
  bind_cols(
    data_withoutdirectidentifiers %>% select(religion, starts_with("pol_"))
  )

anonymised_summary <- anon_data %>%
  group_by(religion) %>%
  summarise(
    mean_immigration    = mean(pol_immigration,    na.rm = TRUE),
    mean_redistribution = mean(pol_redistribution, na.rm = TRUE),
    n = n()
  )

original_summary
# A tibble: 7 × 4
  religion          mean_immigration mean_redistribution     n
  <chr>                        <dbl>               <dbl> <int>
1 Buddhism                      3                   4        1
2 Catholicism                   2.74                3.26    53
3 Eastern Orthodoxy             4.33                2.33     3
4 Islam                         2.62                3       13
5 Judaism                       2.8                 2.4      5
6 None                          2.87                3.21    86
7 Protestantism                 2.97                3.28    39
anonymised_summary
# A tibble: 7 × 4
  religion          mean_immigration mean_redistribution     n
  <chr>                        <dbl>               <dbl> <int>
1 Buddhism                      3                   4        1
2 Catholicism                   2.74                3.26    53
3 Eastern Orthodoxy             4.33                2.33     3
4 Islam                         2.62                3       13
5 Judaism                       2.8                 2.4      5
6 None                          2.87                3.21    86
7 Protestantism                 2.97                3.28    39

If the group means are close between both summaries, the anonymisation has preserved the analytical signal. Political opinion variables were not altered (they are neither key variables nor numeric variables in the sdcMicro object), so differences here would only arise if the grouping structure changed dramatically.

Step 4 – Reflection

There is no single correct answer. Consider:

  • If income distributions look very similar and group means are preserved → the anonymisation is fit for purpose for this analysis
  • If you used k = 5 and results still look good, increasing to k = 3 would reduce distortion while still providing some protection — but weigh this against the risk reduction you found in the earlier exercises
  • Variables like plz_region and age_group were generalised; if a future analysis requires finer geographic or age breakdowns, you would need to negotiate a different anonymisation strategy with the data controller

Learning Objective

  • After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.

Exercises

  • apply measurement of k-anonymity

  • apply measurement of utility

Back to top

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.