Balancing Utility and Privacy

  • After completing this part of the tutorial, you will be able to make informed decisions when balancing the risks and utility of the anonymized data.

As we have seen, anonymization techniques are all about hiding or obscuring the data or links within the data. This, of course, comes at a cost: The data may lose some of its utility.

We therefore need to find a well-calibrated balance between utility and anonymization, the famous sweet spot where we share data “as open as possible, as closed as necessary.” To achieve this, we can utilize various measures - many of which we have already seen in the chapter on assessing privacy risks and in various sdcMicro outputs.

Measuring Privacy

The most straightforward way to measure privacy after anonymization is to compare k-anonymity before and after. In the chapter on assessing privacy risks, we found that nearly every participant in the original dataset had a unique combination of key variables. After applying non-perturbative techniques in the chapter on non-perturbative techniques, we achieved 3-anonymity - every record shares its key variable combination with at least two others. Printing the sdcObject gives you the current k-anonymity level directly:

sdc_nonpert
The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 92.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 495.71
  Difference of Eigenvalues: 6.200%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------

The k-anonymity and disclosure risks are objective values, but always need to be considered in context: What variables did we define within the sdcObject? What external data exist? What is the sampling strategy and what is known about that?

In practice, it can be challenging to correctly assess these risks given the complexity of research projects and existing information. In such cases, red teaming the data anonymization can help (Jansen et al. 2026): Another person (e.g., another member of your lab) is instructed to play the role of an attacker, by, e.g., searching for external information or singling out the least protected individuals.

In general, it is worth it to check the assumptions of the anonymization process every few years after publication of the data, especially if new techniques have emerged (see callout box below).

NoteDeanonymization Using AI

AI tools are increasingly being used to re-identify individuals from datasets that were considered anonymous. Recent research shows that large language models (LLMs) can match anonymous text profiles to named individuals across platforms - for example, linking an anonymous forum post to a person’s identified social media account (Lermen et al., n.d.). Separately, research has shown that individuals can be re-identified from behavioral traces alone, including web browsing history, geolocation data, and even face scans, by combining datasets that each seem innocuous on their own (Rocher et al. 2025).

For quantitative survey or experimental data of the kind covered in this tutorial, I do not expect AI tools to fundamentally change the re-identification risk - the attack surface is just not the same as for text or behavioral data. That said, AI can make an attacker more efficient: tasks like searching for external information, generating plausible attack combinations, or automating record linkage are all easier with modern AI tools.

Measuring Utility

How do we know whether our anonymized data is still useful? There are several approaches to measuring utility, ranging from simple comparisons to formal indices.

sdcMicro provides two built-in information loss measures for numeric variables. IL1 is the average absolute difference between original and anonymized values, relative to the original. A value of 0 means no change; higher values mean more distortion. The eigenvalue ratio compares the eigenvalues of the original and anonymized data. Values close to 0 indicate the least changes. These indicators are primarily useful to compare different anonymization methods with each other.

Both values are part of the standard sdcMicro output after applying any anonymization technique.

Exercise: Evaluating Data Utility

Using sdcMicro’s Parameters

Let’s go back to the two sdcObjects we created in the chapter on perturbative techniques: sdc_micro (where we applied microaggregation) and sdc_noise (where we applied noise).

  1. Compare the utility with IL1 and the difference of eigenvalues. Which method do you prefer when looking at these utility indices?
  2. Make a decision for one of the methods and export the data for further use.

Simply inspect the sdcObjects to see the values of IL1 and the eigenvalue difference:

sdc_micro
The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 77.50%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 1029.80
  Difference of Eigenvalues: 6.100%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------
sdc_noise
The input dataset consists of 200 rows and 12 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 76.50%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 893.79
  Difference of Eigenvalues: 6.460%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------

Both indices prefer the microaggregation approach: The IL1 is much smaller, indicating that values are closer to the original data, and the difference of eigenvalues is slightly smaller, also indicating that microaggregation preserves the most utility.

Now I can extract the dataset from the sdc_micro object. Because sdcMicro stores recoded key variables as integer codes internally, I need to restore the labels manually using mutate:

data_anonymized <- extractManipData(sdc_micro)

# Restore labels for variables recoded inside sdcMicro
age_labels    <- c("18-29", "30-39", "40-49", "50-59", "60+")
plz_labels    <- c("8xxxx", "other")

data_anonymized <- data_anonymized %>%
  mutate(
    age = factor(age, levels = seq_along(age_labels), labels = age_labels),
    plz = factor(plz, levels = seq_along(plz_labels), labels = plz_labels)
  )

As a last anonymization step, I will now generalize rare religion categories using fct_lump_min() from forcats. Since religion is not a variable in the sdcObject, I apply this directly to the dataset as an additional layer, protecting individuals with rare religions.

# Inspect counts first
table(data_anonymized$religion)

         Buddhism       Catholicism Eastern Orthodoxy             Islam 
                3                56                 1                13 
          Judaism              None     Protestantism 
                2                75                50 
data_anonymized <- data_anonymized %>%
  mutate(religion = fct_lump_min(as.factor(religion), min = 10, other_level = "Other"))

table(data_anonymized$religion)

  Catholicism         Islam          None Protestantism         Other 
           56            13            75            50             6 

fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes.

Using Statistics

For non-numeric variables that were recoded or suppressed, there is no single formal measure - but you can compare frequency tables before and after. For example, comparing the distribution of age bands or religion categories tells you whether the recoding preserved the overall composition of the sample.

Beyond these built-in measures, the most practically relevant utility check is simply to re-run the analysis that motivated data collection on both the original and anonymized datasets and compare the results. For our dataset, that means checking whether the association between religion and political opinion is similar to before anonymization. If the key result holds, the anonymized data is fit for purpose.

  1. Compare descriptive statistics on the indirect identifiers age, gender, education, income, plz, and years in job before and after anonymization.
  2. Compare the relation between religion and the political opinion variables pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration before and after anonymization.
  3. Reflect on these comparisons: Is utility well-enough preserved?

Descriptive statistics on indirect identifiers

Compare distributions before and after using summary() and frequency tables:

# Before anonymization
summary(data_withoutdirectidentifiers[, c("age", "income", "years_in_job")])
      age            income        years_in_job  
 Min.   :18.00   Min.   :  1508   Min.   : 0.00  
 1st Qu.:30.00   1st Qu.: 36964   1st Qu.: 2.00  
 Median :41.00   Median : 52219   Median : 4.00  
 Mean   :40.99   Mean   : 60789   Mean   : 5.34  
 3rd Qu.:50.00   3rd Qu.: 64582   3rd Qu.: 8.00  
 Max.   :70.00   Max.   :902234   Max.   :32.00  
table(data_withoutdirectidentifiers$gender, useNA = "always") # Make sure to show NAs as well

    female       male non-binary       <NA> 
       104         86         10          0 
table(data_withoutdirectidentifiers$education, useNA = "always")

doctoral title    high school      no degree   trade school     university 
             9            108             10             36             37 
          <NA> 
             0 
table(data_withoutdirectidentifiers$plz, useNA = "always")

 1445  1587  2763  3253  4416  4821  6132  6217  6847  7806  7985  8141  8324 
    1     1     1     2     1     1     1     1     1     1     1     1     1 
 9380  9575 13055 13059 14822 18225 19209 21789 22523 23863 24161 24784 24863 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
25845 25879 26121 26131 27333 28755 29365 29525 31096 31249 31637 32457 32832 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
33719 33758 35619 35764 36043 36391 37079 37449 39120 39122 39444 39599 39649 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
44145 44869 45479 45657 46395 47226 47249 49429 50735 50859 52224 53225 53545 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
53773 54290 54306 54413 54455 54538 56340 56424 57648 58708 63920 64385 64546 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
64673 65623 65931 66509 66589 67468 67482 69437 72149 72224 72348 72505 72535 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
72660 73734 74388 74906 76137 76229 76448 76530 76698 76770 77756 77815 79256 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
79585 79771 79793 79875 80331 80333 80539 80636 80799 82110 82269 84130 84140 
    1     1     1     1     9     7    13    13    10     1     1     1     1 
84524 85777 86153 86492 86567 86685 86688 86690 86701 86759 87730 88260 88319 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
90441 90613 91097 91284 91731 92289 92339 93093 94113 94239 94366 94522 94535 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
94538 96176 96277 97222 97505 97922 97944 99706 99830  <NA> 
    1     1     1     1     1     1     1     1     1     0 
# After anonymization
summary(data_anonymized[, c("income", "years_in_job")])
     income       years_in_job  
 Min.   : 3576   Min.   : 0.00  
 1st Qu.:37246   1st Qu.: 2.00  
 Median :52156   Median : 4.00  
 Mean   :51130   Mean   : 5.16  
 3rd Qu.:64451   3rd Qu.: 8.00  
 Max.   :89186   Max.   :15.00  
table(data_anonymized$age, useNA = "always")

18-29 30-39 40-49 50-59   60+  <NA> 
   47    43    58    29    23     0 
table(data_anonymized$gender, useNA = "always")

    female       male non-binary       <NA> 
       104         86          4          6 
table(data_anonymized$education, useNA = "always")

doctoral title    high school   trade school     university           <NA> 
             5            106             22             27             40 
table(data_anonymized$plz, useNA = "always")

8xxxx other  <NA> 
  108    90     2 

Age and postal code are now categorical bands, so the numeric summary is no longer meaningful - use frequency tables instead. Income and years in job are still numeric after microaggregation; their medians are very close to the original, but especially income’s mean is lower now - this is due to the extreme outliers that were top-coded.

For gender, 6 non-binary individuals’ data was suppressed to achieve necessary k-anonymity.

For education, all individuals without a degree were recoded to NA; in each other category, few individuals’s data was suppressed. This might be a problem for data utility, especially if our goal was to show that we reached individuals from all educational levels. If this was the case, I’d recommend to go back to applying local suppression and to prioritize that educational level stays non-suppressed.

We lost, as intended, a lot of information on plz and age. For our use case, this is okay.

Religion and political opinions

# Before
data_withoutdirectidentifiers %>%
  group_by(religion) %>%
  summarise(across(c(pol_immigration, pol_environment, pol_redistribution, pol_eu_integration), mean, .names = "mean_{.col}"))
# A tibble: 7 × 5
  religion      mean_pol_immigration mean_pol_environment mean_pol_redistribut…¹
  <chr>                        <dbl>                <dbl>                  <dbl>
1 Buddhism                      4                    3.33                   3.33
2 Catholicism                   2.73                 3.07                   2.79
3 Eastern Orth…                 4                    2                      3   
4 Islam                         2.69                 2.92                   3.23
5 Judaism                       1.5                  3.5                    3   
6 None                          2.97                 3.03                   3.17
7 Protestantism                 3.04                 3.2                    3.12
# ℹ abbreviated name: ¹​mean_pol_redistribution
# ℹ 1 more variable: mean_pol_eu_integration <dbl>
# After
data_anonymized %>%
  group_by(religion) %>%
  summarise(across(c(pol_immigration, pol_environment, pol_redistribution, pol_eu_integration), mean, .names = "mean_{.col}"))
# A tibble: 5 × 5
  religion      mean_pol_immigration mean_pol_environment mean_pol_redistribut…¹
  <fct>                        <dbl>                <dbl>                  <dbl>
1 Catholicism                   2.73                 3.07                   2.79
2 Islam                         2.69                 2.92                   3.23
3 None                          2.97                 3.03                   3.17
4 Protestantism                 3.04                 3.2                    3.12
5 Other                         3.17                 3.17                   3.17
# ℹ abbreviated name: ¹​mean_pol_redistribution
# ℹ 1 more variable: mean_pol_eu_integration <dbl>

The group-level means should be very the same between the original and anonymized datasets. The key research question - whether religion predicts political opinion - is still answerable. Note that some small religion categories have been merged into “Other”, which changes the granularity of analysis but preserves the main patterns. Other than that, we did not change anything regarding these variables.

Reflection

For the purpose of this dataset - studying the relationship between religion and political opinion - utility is well-preserved. The main variable of interest (religion) is still present, just with fewer fine-grained categories. The political opinion variables are unchanged. Income and years in job show very similar distributions after microaggregation.

Striking the Right Balance

Balancing privacy and utility is not a one-shot decision - it is an iterative process. After measuring both, you may find that your current anonymization is too aggressive (data is well-protected but the key analysis no longer works) or too lenient (utility is preserved but too many records are still unique). In either case, you go back to the anonymization steps and adjust: changing recoding thresholds, relaxing or tightening the k-anonymity target, or choosing a different perturbative technique.

For our dataset, a reasonable workflow looks like this: start by checking the k-anonymity level and disclosure risk on sdc_nonpert (from the chapter on non-perturbative techniques), then check the information loss after applying perturbative techniques on sdc_micro or sdc_noise (from the chapter on perturbative techniques), and finally compare the religion-political opinion analysis on original and anonymized data. If the analysis still holds and k-anonymity is at least 3, the anonymization is likely sufficient for sharing.

There are no universal thresholds for “good enough” privacy or utility. A commonly used rule of thumb is that k-anonymity should be at least 3-5 for most research datasets, meaning every combination of indirect identifiers appears at least 3-5 times (see discussion above). For utility, IL1 values eigenvalue differences close to 0 indicate low distortion. But ultimately, the right balance depends on your specific context: how sensitive is the data, who will have access, and what analyses need to be supported. When in doubt, consult your institution’s data protection officer (here at LMU), data steward (or a local research data management team, here at LMU), or open science team (here at LMU).

Let’s keep in mind, that while there is a tension between data protection and openness, in practice, these principles enable each other: Sharing data builds trust in research findings, while protecting participants builds trust in researchers. When done well, anonymization is the bridge between these goals - it lets you share data openly while honoring the privacy expectations of the people who provided it (Jansen et al. 2025).

Back to top

References

Jansen, Luisa, Nele Borgert, and Malte Elson. 2025. “On the Tension Between Open Data and Data Protection in Research.” Pre-published April 7. https://doi.org/10.31234/osf.io/5jt3s_v2.
Jansen, Luisa, Tim Ulmann, Robine Jordi, and Malte Elson. 2026. Putting Privacy to the Test: Introducing Red Teaming for Research Data Anonymization. arXiv:2601.19575. arXiv. https://doi.org/10.48550/arXiv.2601.19575.
Lermen, Simon, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tramèr. n.d. Large-Scale Online Deanonymization with LLMs. https://doi.org/10.48550/ARXIV.2602.16800.
Rocher, Luc, Julien M. Hendrickx, and Yves-Alexandre De Montjoye. 2025. “A Scaling Law to Model the Effectiveness of Identification Techniques.” Nature Communications 16 (1): 347. https://doi.org/10.1038/s41467-024-55296-6.