Non-Perturbative Techniques

Non-perturbative techniques are those that do not change the values of data, but rather conceal certain values. This includes all techniques referred to as data masking, suppression, and deletion.

Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):

Examples of Non-Perturbative Techniques

explain each technique, use research-related terms, and add visualizations

Global Recoding

Also known as generalization

Generalize all values of an attribute into broader categories.

Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade

Local Recoding

Generalize values of a variable only when needed, not across the entire dataset.

Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.

This means, that values are on different level (in this example: country vs. continent).

Top-and-Bottom Coding

Applies only to numerical (at least ordinally scaled) variables

Generalize rare extreme values that would allow for the identification of individuals.

Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”

How to determine the threshold can be calculated, but this is not trivial

Explain how to find the threshold

Suppression

Remove information entirely (using NA/NaN/*) when it creates risk (also known as nulling). This can occur on different levels:

Cell suppression: Remove single risky values (e.g., unique job title).
Record suppression: Remove data of one participant (e.g., in case of a very specific combination of demographics).
Variable suppression: Drop a variable from the dataset entirely.

It can also happen as a result of a condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in the column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013).

Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.

Sampling

also known as subsampling
especially used when a dataset covers the whole population (e.g., all first-year students at LMU)
release a sample of the original dataset
sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects)

Include its own exercise for this technique?

Pro and Contra Using Non-Perturbative Techniques

Add pro and con list

Applying Non-Perturbative Techniques With R

basic data wrangling operations in R or functions in sdcMicro (better than manual changes)
but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries; see Chapter DOCUMENTATION)

Insert exercise here

Learning Objective

After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques

Exercises

choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)

Resources, Links, Examples

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.

Raghunathan, Balaji. 2013. The Complete Book of Data Anonymization: From Planning to Implementation. CRC Press.