Non-Perturbative Techniques
Non-perturbative techniques are those that do not change the values of data, but rather conceal certain values. This includes all techniques referred to as data masking, suppression, and deletion.
Add pro and con list?
Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):
Global Recoding
Also known as generalization
Generalize all values of an attribute into broader categories.
Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade
Local Recoding
Generalize values of a variable only when needed, not across the entire dataset.
Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.
This means, that values are on different level (in this example: country vs. continent).
Top-and-Bottom Coding
Applies only to numerical (at least ordinally scaled) variables
Generalize rare extreme values that would allow for the identification of individuals.
Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”
How to determine the threshold can be calculated, but this is not trivial/recommendation instead?
Suppression
Remove information entirely (using NA/NaN/*) when it creates risk (also known as nulling). This can occur on different levels:
Cell suppression: Remove single risky values (e.g., unique job title).
Record suppression: Remove data of one participant (e.g., in case of a very specific combination of demographics).
Variable suppression: Drop a variable from the dataset entirely.
It can also happen as a result of a condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in a column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013).
Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.
Sampling
also known as subsampling
especially used when a dataset covers the whole population (e.g., all first-year students at LMU)
release a sample of the original dataset
sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects)
Include its own exercise for this technique?
Applying Non-Perturbative Techniques With R
basic data wrangling operations in R or functions in sdcMicro (better than manual changes)
but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries)
Insert exercise here
Learning Objective
- After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
- After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques
Exercises
choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)
Second exercise: anonymize the script (potentially an exercise for the documentation chapter)