Non-perturbative techniques are those that do not change the values of data, but rather conceal certain values. This includes all techniques referred to as data masking, suppression, and deletion.
Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):
Examples of Non-Perturbative Techniques
explain each technique, use research-related terms, and add visualizations
Global Recoding
Also known as generalization
Generalize all values of an attribute into broader categories.
Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade
Local Recoding
Generalize values of a variable only when needed, not across the entire dataset.
Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.
This means, that values are on different level (in this example: country vs. continent).
Top-and-Bottom Coding
Applies only to numerical (at least ordinally scaled) variables
Generalize rare extreme values that would allow for the identification of individuals.
Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”
How to determine the threshold can be calculated, but this is not trivial
Explain how to find the threshold
Suppression
Remove information entirely (using NA/NaN/*) when it creates risk (also known as nulling). This can occur on different levels:
Cell suppression: Remove single risky values (e.g., unique job title).
Record suppression: Remove data of one participant (e.g., in case of a very specific combination of demographics).
Variable suppression: Drop a variable from the dataset entirely.
It can also happen as a result of a condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in the column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013).
Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.
Sampling
also known as subsampling
especially used when a dataset covers the whole population (e.g., all first-year students at LMU)
release a sample of the original dataset
sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects)
Include its own exercise for this technique?
Pro and Contra Using Non-Perturbative Techniques
Add pro and con list
Applying Non-Perturbative Techniques With R
basic data wrangling operations in R or functions in sdcMicro (better than manual changes)
but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries; see Chapter DOCUMENTATION)
Exercise: Applying Non-Perturbative Techniques
TODO Decide on variables
Apply non-perturbative techniques to the data to reduce re-identification risk.
The following sdcMicro functions are available to you:
globalRecode() — recodes a key variable into broader groups (e.g., exact age → age band)
topBotCoding() — caps extreme values at a given threshold
localSuppression() — suppresses individual cells in records that fall below a chosen k-anonymity threshold. I would advise against using this function since it does not report changes transparently. Instead, you can recode data manually, e.g., by using mutate, or by using a function such as fct_lump_min() from forcats. This function lumps all categories below a certain group size together.
Look at the key variables in sdc_data (gender, age, education, plz) and decide which technique is appropriate for each. There is no single correct solution — choose thresholds and groupings that make sense given the data, and be prepared to justify your choices. After applying your techniques, compare the new risk summary to the one from the previous exercise.
Tip
The sdcMicro functions operate directly on the sdc object and update risk estimates automatically. Check ?globalRecode, ?topBotCoding, and ?localSuppression for argument details.
The forcat function fct_lump_min() is applied to the dataset itself. The sdc object needs to be updated afterwards.
TipSolution
One possible approach:
# Recode age into bandssdc_nonpert <-globalRecode(obj = sdc_data,column ="age",breaks =c(17, 29, 44, 59, 74, Inf),labels =c("18-29", "30-44", "45-59", "60-74", "75+")) print(sdc_nonpert, type ="risk") # check risk after making the change
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 0
in original data: 0
Expected number of re-identifications:
in modified data: 194.00 (97.00 %)
in original data: 200.00 (100.00 %)
# Generalize PLZ (does not work due to variable type)# sdc_nonpert <- globalRecode(# obj = sdc_data,# column = "plz",# breaks = c(0, 19999, 39999, 59999, 79999, 99999),# labels = c("00–19", "20–39", "40–59", "60–79", "80–99")# )# `globalRecode()` updates the key variable inside the `sdc` object directly and recalculates risk. The five regions map loosely to north, north-central, central, south-central, and south Germany.# Top-code incomesdc_nonpert <-topBotCoding(obj = sdc_nonpert,column ="income",value =quantile(data_withoutdirectidentifiers$income, 0.95),replacement =quantile(data_withoutdirectidentifiers$income, 0.95),kind ="top")print(sdc_nonpert, type ="risk") # check risk after making the change
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 0
in original data: 0
Expected number of re-identifications:
in modified data: 194.00 (97.00 %)
in original data: 200.00 (100.00 %)
# Top code years in job# Inspect the distribution first to pick a sensible thresholdsummary(data_withoutdirectidentifiers$years_in_job)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.00 5.00 7.34 11.00 38.00
sdc_nonpert <-topBotCoding(obj = sdc_nonpert,column ="years_in_job",value =20, # values above 20 years get cappedreplacement =20,kind ="top")print(sdc_nonpert, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 0
in original data: 0
Expected number of re-identifications:
in modified data: 194.00 (97.00 %)
in original data: 200.00 (100.00 %)
# `years_in_job` follows an exponential distribution, so a small number of people will have very high values. Capping at 20 years is a reasonable choice — adjust based on what the actual distribution looks like in your dataset.# Apply local suppression to reach k = 3sdc_nonpert <-localSuppression(sdc_nonpert, k =3) # Compare risk before and afterprint(sdc_nonpert, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 26
in original data: 0
Expected number of re-identifications:
in modified data: 23.94 (11.97 %)
in original data: 200.00 (100.00 %)
data_religion_recoded <- data_withoutdirectidentifiers %>%mutate(religion =fct_lump_min(religion, min =10, other_level ="Other"))table(data_religion_recoded$religion)
Catholicism Islam None Protestantism Other
53 13 86 39 9
# Rebuild the sdc object with the recoded religion columnsdc_nonpert <-createSdcObj(dat = data_religion_recoded,keyVars =c("gender", "age", "education", "plz"),numVars =c("income", "years_in_job"))# Now apply the previous steps again on the fresh object, then checkprint(sdc_nonpert, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data: 0
Expected number of re-identifications: 200.00 (100.00 %)
fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". This is transparent — you can see exactly which categories were affected by comparing the table() outputs before and after. Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes. Compare to local suppression using function
Save sdcObject
saveRDS(sdc_nonpert, "../sdc_nonpert.rds")
Learning Objective
After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques
Exercises
choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)