Balancing Utility and Privacy

Learning Objective

After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.

As we have seen, anonymization techniques are all about hiding or obscuring the data or links within the data. This, of course, comes at a cost: The data may lose some of its utility.

We therefore need to find a well-calibrated balance between utility and anonymization, the famous sweet spot where we share data “as open as possible, as closed as necessary.” To achieve this, we can utilize various measures—many of which we have already seen in the chapter on assessing privacy risks and in various sdcMicro outputs.

Measuring Privacy

The most straightforward way to measure privacy after anonymization is to compare k-anonymity before and after applying an anonimization technique. In the chapter on assessing privacy risks, we found that nearly every participant in the original dataset had a unique combination of key variables. After applying non-perturbative techniques in the chapter on non-perturbative techniques, we achieved 3-anonymity—every record shares its key variable combination with at least two others. Printing the sdcObject gives you the current k-anonymity level directly:

# Check the current k-anonymity level
sdc_nonpert

The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------

Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!

 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)

----------------------------------------------------------------------

Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------

Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 92.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 495.71
  Difference of Eigenvalues: 6.200%
----------------------------------------------------------------------

Local suppression:

    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000

----------------------------------------------------------------------

The k-anonymity and disclosure risks are objective values, but always need to be considered in context: What variables did we define within the sdcObject? What external data exist? What is the sampling strategy and what is known about that?

In practice, it can be challenging to correctly assess these risks given the complexity of research projects and existing information. In such cases, red teaming the data anonymization can help (Jansen et al. 2026): Another person (e.g., another member of your lab) is instructed to play the role of an attacker, by, e.g., searching for external information or singling out the least protected individuals.

In general, it is worth it to check the assumptions of the anonymization process every few years after publication of the data, especially if new techniques have emerged (see callout box below).

Deanonymization Using AI

AI tools are increasingly being used to re-identify individuals from datasets that were considered anonymous. Recent research shows that large language models (LLMs) can match anonymous text profiles to named individuals across platforms—for example, linking an anonymous forum post to a person’s identified social media account (Lermen et al., n.d.). Separately, research has shown that individuals can be re-identified from behavioral traces alone, including web browsing history, geolocation data, and even face scans, by combining datasets that each seem innocuous on their own (Rocher et al. 2025).

For quantitative survey or experimental data of the kind covered in this tutorial, I do not expect AI tools to fundamentally change the re-identification risk—the attack surface is just not the same as for text or behavioral data. That said, AI can make an attacker more efficient: tasks like searching for external information, generating plausible attack combinations, or automating record linkage are all easier with modern AI tools.

Measuring Utility

How do we know whether our anonymized data is still useful? There are several approaches to measuring utility, ranging from simple comparisons to formal indices.

sdcMicro provides two built-in information loss measures for numeric variables. IL1 is the sum of absolute differences between original and anonymized values, relative to the original. A value of 0 means no change; higher values mean more distortion. The eigenvalue ratio compares the eigenvalues of the original covariance matrix to the eigenvalues of the anonymized data’s covariance matrix. Values close to 0 indicate the least changes. These indicators are primarily useful to compare different anonymization methods with each other.

Both values are part of the standard sdcMicro output after applying any anonymization technique.

Exercise: Evaluating Data Utility

Using `sdcMicro`’s Parameters

Let’s go back to the two sdcObjects we created in the chapter on perturbative techniques: sdc_micro (where we applied microaggregation) and sdc_noise (where we applied noise).

Compare the utility with IL1 and the difference of eigenvalues. Which method do you prefer when looking at these utility indices?
Make a decision for one of the methods and export the data for further use.

Solution

Inspect the sdcObjects to see the values of IL1 and the eigenvalue difference:

# Compare utility: IL1 and eigenvalue difference
sdc_micro

The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------

Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!

 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                   52   (52)     3.846  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                  152  (152)     1.316  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                     1    (1)
                     9    (9)
                     1    (1)

----------------------------------------------------------------------

Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 198 (99.000%)
  - 3-anonymity: 200 (100.000%)
  - 5-anonymity: 200 (100.000%)

----------------------------------------------------------------------

Numerical key variables: income, years_in_job

Disclosure risk is currently between [0.00%; 100.00%]

Current Information Loss:
  - IL1: 0.00
  - Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------

sdc_noise

The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------

Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!

 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)

----------------------------------------------------------------------

Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------

Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 77.50%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 880.05
  Difference of Eigenvalues: 5.950%
----------------------------------------------------------------------

Local suppression:

    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000

----------------------------------------------------------------------

Both indices prefer the microaggregation approach: The IL1 is much smaller, indicating that values are closer to the original data, and the difference of eigenvalues is slightly smaller, also indicating that microaggregation preserves the most utility.

Now I can extract the dataset from the sdc_micro object. Because sdcMicro stores recoded key variables as integer codes internally, I need to restore the labels manually using mutate:

# Extract the anonymized data and restore category labels
data_anonymized <- extractManipData(sdc_micro)

# Restore labels for variables recoded inside sdcMicro
age_labels    <- c("18-29", "30-39", "40-49", "50-59", "60+")
plz_labels    <- c("8xxxx", "other")

data_anonymized <- data_anonymized %>%
  mutate(
    age = factor(age, levels = seq_along(age_labels), labels = age_labels),
    plz = factor(plz, levels = seq_along(plz_labels), labels = plz_labels)
  )

Using Statistics

For non-numeric variables that were recoded or suppressed, there is no single formal measure—but you can compare frequency tables before and after. For example, comparing the distribution of age bands or religion categories tells you whether the recoding preserved the overall composition of the sample.

Beyond these built-in measures, the most practically relevant utility check is simply to re-run the analysis that motivated data collection on both the original and anonymized datasets and compare the results. For our dataset, that means checking whether the association between religion and political opinion is similar to before anonymization. If the key result holds, the anonymized data is fit for purpose.

Compare descriptive statistics on the indirect identifiers age, gender, education, income, plz, and years in job before and after anonymization.
Compare the relation between religion and the political opinion variables pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration before and after anonymization.
Reflect on these comparisons: Is utility well-enough preserved?

Solution

Descriptive statistics on indirect identifiers

Compare distributions before and after using summary() and frequency tables:

# Compare descriptive statistics before and after anonymization
# Before anonymization
summary(data_withoutdirectidentifiers[, c("age", "income", "years_in_job")])

      age            income        years_in_job  
 Min.   :18.00   Min.   :  1508   Min.   : 0.00  
 1st Qu.:30.00   1st Qu.: 36964   1st Qu.: 2.00  
 Median :41.00   Median : 52219   Median : 4.00  
 Mean   :40.99   Mean   : 60789   Mean   : 5.34  
 3rd Qu.:50.00   3rd Qu.: 64582   3rd Qu.: 8.00  
 Max.   :70.00   Max.   :902234   Max.   :32.00

table(data_withoutdirectidentifiers$gender, useNA = "always") # Make sure to show NAs as well


    female       male non-binary       <NA> 
       104         86         10          0

table(data_withoutdirectidentifiers$education, useNA = "always")


doctoral title    high school      no degree   trade school     university 
             9            108             10             36             37 
          <NA> 
             0

table(data_withoutdirectidentifiers$plz, useNA = "always")


 1445  1587  2763  3253  4416  4821  6132  6217  6847  7806  7985  8141  8324 
    1     1     1     2     1     1     1     1     1     1     1     1     1 
 9380  9575 13055 13059 14822 18225 19209 21789 22523 23863 24161 24784 24863 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
25845 25879 26121 26131 27333 28755 29365 29525 31096 31249 31637 32457 32832 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
33719 33758 35619 35764 36043 36391 37079 37449 39120 39122 39444 39599 39649 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
44145 44869 45479 45657 46395 47226 47249 49429 50735 50859 52224 53225 53545 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
53773 54290 54306 54413 54455 54538 56340 56424 57648 58708 63920 64385 64546 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
64673 65623 65931 66509 66589 67468 67482 69437 72149 72224 72348 72505 72535 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
72660 73734 74388 74906 76137 76229 76448 76530 76698 76770 77756 77815 79256 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
79585 79771 79793 79875 80331 80333 80539 80636 80799 82110 82269 84130 84140 
    1     1     1     1     9     7    13    13    10     1     1     1     1 
84524 85777 86153 86492 86567 86685 86688 86690 86701 86759 87730 88260 88319 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
90441 90613 91097 91284 91731 92289 92339 93093 94113 94239 94366 94522 94535 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
94538 96176 96277 97222 97505 97922 97944 99706 99830  <NA> 
    1     1     1     1     1     1     1     1     1     0

# After anonymization
summary(data_anonymized[, c("income", "years_in_job")])

     income        years_in_job  
 Min.   :  1508   Min.   : 0.00  
 1st Qu.: 36964   1st Qu.: 2.00  
 Median : 52219   Median : 4.00  
 Mean   : 60789   Mean   : 5.34  
 3rd Qu.: 64582   3rd Qu.: 8.00  
 Max.   :902234   Max.   :32.00

table(data_anonymized$age, useNA = "always")


18-29 30-39 40-49 50-59   60+  <NA> 
    0     0     0     0     0   200

table(data_anonymized$gender, useNA = "always")


    female       male non-binary       <NA> 
       104         86         10          0

table(data_anonymized$education, useNA = "always")


doctoral title    high school      no degree   trade school     university 
             9            108             10             36             37 
          <NA> 
             0

table(data_anonymized$plz, useNA = "always")


8xxxx other  <NA> 
    0     0   200

Age and postal code are now categorical bands, so the numeric summary is no longer meaningful—use frequency tables instead. Income and years in job are still numeric after microaggregation; their medians are very close to the original, but especially income’s mean is lower now—this is due to the extreme outliers that were top-coded.

For gender, 6 non-binary individuals’ data was suppressed to achieve necessary k-anonymity.

For education, all individuals without a degree were recoded to NA; in each other category, few individuals’s data was suppressed. This might be a problem for data utility, especially if our goal was to show that we reached individuals from all educational levels. If this was the case, I’d recommend to go back to applying local suppression and to prioritize that educational level stays non-suppressed.

We lost, as intended, a lot of information on plz and age. For our use case, this is okay.

Religion and political opinions

# Compare religion vs. political opinion before and after anonymization
# Before
data_withoutdirectidentifiers %>%
  group_by(religion) %>%
  summarise(across(c(pol_immigration, pol_environment, pol_redistribution, pol_eu_integration), mean, .names = "mean_{.col}"))

# A tibble: 7 × 5
  religion      mean_pol_immigration mean_pol_environment mean_pol_redistribut…¹
  <chr>                        <dbl>                <dbl>                  <dbl>
1 Buddhism                      4                    3.33                   3.33
2 Catholicism                   2.73                 3.07                   2.79
3 Eastern Orth…                 4                    2                      3   
4 Islam                         2.69                 2.92                   3.23
5 Judaism                       1.5                  3.5                    3   
6 None                          2.97                 3.03                   3.17
7 Protestantism                 3.04                 3.2                    3.12
# ℹ abbreviated name: ¹mean_pol_redistribution
# ℹ 1 more variable: mean_pol_eu_integration <dbl>

# After
data_anonymized %>%
  group_by(religion) %>%
  summarise(across(c(pol_immigration, pol_environment, pol_redistribution, pol_eu_integration), mean, .names = "mean_{.col}"))

# A tibble: 7 × 5
  religion      mean_pol_immigration mean_pol_environment mean_pol_redistribut…¹
  <chr>                        <dbl>                <dbl>                  <dbl>
1 Buddhism                      4                    3.33                   3.33
2 Catholicism                   2.73                 3.07                   2.79
3 Eastern Orth…                 4                    2                      3   
4 Islam                         2.69                 2.92                   3.23
5 Judaism                       1.5                  3.5                    3   
6 None                          2.97                 3.03                   3.17
7 Protestantism                 3.04                 3.2                    3.12
# ℹ abbreviated name: ¹mean_pol_redistribution
# ℹ 1 more variable: mean_pol_eu_integration <dbl>

The group-level means should be very the same between the original and anonymized datasets. The key research question—whether religion predicts political opinion—is still answerable. We did not change anything regarding these variables.

Reflection

For the purpose of this dataset—studying the relationship between religion and political opinion—utility is well-preserved. The main variable of interest (religion) is still present, just with fewer fine-grained categories. The political opinion variables are unchanged. Income and years in job show very similar distributions after microaggregation.

Striking the Right Balance

Balancing privacy and utility is not a one-shot decision—it is an iterative process. After measuring both, you may find that your current anonymization is too aggressive (data is well-protected but the key analysis no longer works) or too lenient (utility is preserved but too many records are still unique). In either case, you go back to the anonymization steps and adjust: changing recoding thresholds, relaxing or tightening the k-anonymity target, or choosing a different perturbative technique.

For our dataset, a reasonable workflow looks like this: start by checking the k-anonymity level and disclosure risk on sdc_nonpert (from the chapter on non-perturbative techniques), then check the information loss after applying perturbative techniques on sdc_micro or sdc_noise (from the chapter on perturbative techniques), and finally compare the religion-political opinion analysis on original and anonymized data. If the analysis still holds and k-anonymity is at least 3, the anonymization is likely sufficient for sharing.

There are no universal thresholds for “good enough” privacy or utility. A commonly used rule of thumb is that k-anonymity should be at least 3-5 for most research datasets, meaning every combination of indirect identifiers appears at least 3-5 times (see discussion above). For utility, IL1 values and eigenvalue differences close to 0 indicate low distortion. But ultimately, the right balance depends on your specific context: how sensitive is the data, who will have access, and what analyses need to be supported. When in doubt, consult your institution’s data protection officer (LMU Munich data protection office), data steward (or a local research data management team, LMU Munich research data management team), or open science team (LMU Open Science Center).

Who Is Responsible If Someone Is Re-Identified Anyway?

Disclaimer: We do not represent LMU Munich, and we do not have the authority to give legal advice or to approve your anonymization. What follows is our personal reading of the legal situation, intended as orientation—for binding answers, contact your institution’s data protection officer.

You might ask yourself: “What if I do everything right, and someone still gets re-identified? Am I liable?” We don’t think you are responsible for two reasons:

First, the legal standard is not zero risk. Under the GDPR (Recital 26), data counts as anonymous if re-identification is not reasonably likely, considering cost, time, and available technology. Absolute anonymization is usually neither possible nor required—it is sufficient that re-identification is practically infeasible or requires disproportionate effort. The relevant question is therefore not “Could someone ever be re-identified?” but “Was the anonymization decision reasonable at the time?”. A later re-identification does not retroactively prove your procedure was negligent.

Second, legally, who is responsible? Not you personally, in most cases: the controller under the GDPR is typically your institution (e.g., the university), not the individual researcher. The data protection officer advises and monitors compliance—they do not carry the responsibility alone either, and their approval is not a magic shield. What protects you is a documented, institutionally backed process. You are safe if you followed:

a written anonymization plan, agreed on before release;
the advice of your data protection officer (and, where required, a data protection impact assessment);
a documented risk assessment (attack scenarios, indirect identifiers, external linkage, a k-anonymity target or comparable metric—keeping in mind that no k-value is a magic legal threshold);
a deliberate release decision (fully open, restricted access, or another access model);
and did not deviate from the approved plan.

If someone is re-identified anyway, that is an institutional matter—reassessing the dataset, restricting access, notifying where required—not evidence of your personal fault. In our personal opinion, researchers who follow an approved, documented procedure have fulfilled their obligations and are not at fault. The circular logic of “someone was re-identified, therefore your anonymization was too lenient” gets it backward.

Let’s keep in mind that, while there is a tension between data protection and openness, in practice, these principles enable each other: Sharing data builds trust in research findings, while protecting participants builds trust in researchers. When done well, anonymization is the bridge between these goals—it lets you share data openly while honoring the privacy expectations of the people who provided it (Jansen et al. 2025).

Resources, Links, Examples

Examples of decisions for balancing utility and privacy from the UK.

References

Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.

Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.

Lermen, Simon, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tramèr. n.d. Large-Scale Online Deanonymization with LLMs. https://doi.org/10.48550/ARXIV.2602.16800.

Rocher, Luc, Julien M. Hendrickx, and Yves-Alexandre De Montjoye. 2025. “A Scaling Law to Model the Effectiveness of Identification Techniques.” Nature Communications 16 (1): 347. https://doi.org/10.1038/s41467-024-55296-6.

Measuring Privacy

Measuring Utility

Exercise: Evaluating Data Utility

Using sdcMicro’s Parameters

Using Statistics

Striking the Right Balance

Resources, Links, Examples

References

Using `sdcMicro`’s Parameters