Non-Perturbative Techniques

Non-perturbative techniques are those that do not change the values of data, but rather conceal certain values. This includes all techniques referred to as data masking, suppression, and deletion.

Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):

Examples of Non-Perturbative Techniques

explain each technique, use research-related terms, and add visualizations

Global Recoding

Also known as generalization

Generalize all values of an attribute into broader categories.

Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade

Local Recoding

Generalize values of a variable only when needed, not across the entire dataset.

Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.

This means, that values are on different level (in this example: country vs. continent).

Top-and-Bottom Coding

Applies only to numerical (at least ordinally scaled) variables

Generalize rare extreme values that would allow for the identification of individuals.

Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”

How to determine the threshold can be calculated, but this is not trivial

Explain how to find the threshold

Suppression

Remove information entirely (using NA/NaN/*) when it creates risk (also known as nulling). This can occur on different levels:

  • Cell suppression: Remove single risky values (e.g., unique job title).

  • Record suppression: Remove data of one participant (e.g., in case of a very specific combination of demographics).

  • Variable suppression: Drop a variable from the dataset entirely.

It can also happen as a result of a condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in the column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013).

Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.

Sampling

  • also known as subsampling

  • especially used when a dataset covers the whole population (e.g., all first-year students at LMU)

  • release a sample of the original dataset

  • sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects)

Include its own exercise for this technique?

Pro and Contra Using Non-Perturbative Techniques

Add pro and con list

Applying Non-Perturbative Techniques With R

  • basic data wrangling operations in R or functions in sdcMicro (better than manual changes)

  • but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries; see Chapter DOCUMENTATION)

Exercise: Applying Non-Perturbative Techniques

TODO Decide on variables

Apply non-perturbative techniques to the data to reduce re-identification risk.

The following sdcMicro functions are available to you:

  • globalRecode() — recodes a key variable into broader groups (e.g., exact age → age band)
  • topBotCoding() — caps extreme values at a given threshold
  • localSuppression() — suppresses individual cells in records that fall below a chosen k-anonymity threshold. I would advise against using this function since it does not report changes transparently. Instead, you can recode data manually, e.g., by using mutate, or by using a function such as fct_lump_min() from forcats. This function lumps all categories below a certain group size together.

Look at the key variables in sdc_data (gender, age, education, plz) and decide which technique is appropriate for each. There is no single correct solution — choose thresholds and groupings that make sense given the data, and be prepared to justify your choices. After applying your techniques, compare the new risk summary to the one from the previous exercise.

Tip

The sdcMicro functions operate directly on the sdc object and update risk estimates automatically. Check ?globalRecode, ?topBotCoding, and ?localSuppression for argument details.

The forcat function fct_lump_min() is applied to the dataset itself. The sdc object needs to be updated afterwards.

TipSolution

One possible approach:

# Recode age into bands
sdc_nonpert <- globalRecode(
  obj    = sdc_data,
  column = "age",
  breaks = c(17, 29, 44, 59, 74, Inf),
  labels = c("18-29", "30-44", "45-59", "60-74", "75+")
)  

print(sdc_nonpert, type = "risk") # check risk after making the change
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 0
  in original data: 0
Expected number of re-identifications: 
  in modified data: 194.00 (97.00 %)
  in original data: 200.00 (100.00 %)
# Generalize PLZ (does not work due to variable type)
# sdc_nonpert <- globalRecode(
#   obj    = sdc_data,
#   column = "plz",
#   breaks = c(0, 19999, 39999, 59999, 79999, 99999),
#   labels = c("00–19", "20–39", "40–59", "60–79", "80–99")
# )

# `globalRecode()` updates the key variable inside the `sdc` object directly and recalculates risk. The five regions map loosely to north, north-central, central, south-central, and south Germany.

# Top-code income
sdc_nonpert <- topBotCoding(
  obj         = sdc_nonpert,
  column      = "income",
  value       = quantile(data_withoutdirectidentifiers$income, 0.95),
  replacement = quantile(data_withoutdirectidentifiers$income, 0.95),
  kind        = "top"
)

print(sdc_nonpert, type = "risk") # check risk after making the change
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 0
  in original data: 0
Expected number of re-identifications: 
  in modified data: 194.00 (97.00 %)
  in original data: 200.00 (100.00 %)
# Top code years in job
# Inspect the distribution first to pick a sensible threshold
summary(data_withoutdirectidentifiers$years_in_job)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    5.00    7.34   11.00   38.00 
sdc_nonpert <- topBotCoding(
  obj         = sdc_nonpert,
  column      = "years_in_job",
  value       = 20,        # values above 20 years get capped
  replacement = 20,
  kind        = "top"
)

print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 0
  in original data: 0
Expected number of re-identifications: 
  in modified data: 194.00 (97.00 %)
  in original data: 200.00 (100.00 %)
# `years_in_job` follows an exponential distribution, so a small number of people will have very high values. Capping at 20 years is a reasonable choice — adjust based on what the actual distribution looks like in your dataset.

# Apply local suppression to reach k = 3
sdc_nonpert <- localSuppression(sdc_nonpert, k = 3)  # Compare risk before and after

print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 26
  in original data: 0
Expected number of re-identifications: 
  in modified data: 23.94 (11.97 %)
  in original data: 200.00 (100.00 %)
# Collapse rare religions
library(forcats)

# Inspect counts first
table(data_withoutdirectidentifiers$religion)

         Buddhism       Catholicism Eastern Orthodoxy             Islam 
                1                53                 3                13 
          Judaism              None     Protestantism 
                5                86                39 
data_religion_recoded <- data_withoutdirectidentifiers %>%
  mutate(religion = fct_lump_min(religion, min = 10, other_level = "Other"))

table(data_religion_recoded$religion)

  Catholicism         Islam          None Protestantism         Other 
           53            13            86            39             9 
# Rebuild the sdc object with the recoded religion column
sdc_nonpert <- createSdcObj(
  dat     = data_religion_recoded,
  keyVars = c("gender", "age", "education", "plz"),
  numVars = c("income", "years_in_job")
)

# Now apply the previous steps again on the fresh object, then check
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 200.00 (100.00 %)

fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". This is transparent — you can see exactly which categories were affected by comparing the table() outputs before and after. Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes. Compare to local suppression using function

Save sdcObject

saveRDS(sdc_nonpert, "../sdc_nonpert.rds")

Learning Objective

  • After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
  • After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques

Exercises

  • choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)
Back to top

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Raghunathan, Balaji. 2013. The Complete Book of Data Anonymization: From Planning to Implementation. CRC Press.