Choosing the Right Technique
- contextual factors are important for risk analysis (callback to contextual integrity)
- sampling strategy (Guo et al. 2025): can diminish risks (e.g., random sampling in whole population) or heighten them (e.g., all first-year students of one subject at one university)
- release type of data (Benschop and Welch, n.d.): only partial access for very risky data
- sensitivity level: How sensitive is the data? Health data or political opinions may require stronger protection than general demographics.
- intended use: What analyses will users run on the anonymized data? If exact values matter (e.g., income regressions), perturbative methods with low noise may be better than heavy generalization.
- attacker model: Who might try to re-identify individuals, and what external information do they have access to? A dataset about employees of one company faces different risks than a nationally representative survey.
- factors limiting the technique you can apply
- data type: Are your variables categorical, continuous, or a mix? This narrows down which techniques are applicable.
- sample size: Smaller datasets are harder to anonymize because individuals are more likely to be unique. Techniques like suppression or sampling may remove too much data.
Determine more relevant risk factors for making that decision
Create some kind of decision tree/checklist to choose techniques
Exercise
Below are three fictional datasets. For each, decide which anonymization technique(s) you would apply and why.
1. A national survey (n = 10,000) on voting behavior. Variables: age group, gender, federal state, party preference, income bracket. The data will be published as a fully open dataset.
2. A study on workplace bullying at a mid-sized company (n = 120). Variables: department, job level, years at company, bullying score, mental health score. The data will be shared with other researchers under a data use agreement.
3. A clinical trial dataset (n = 500) with diagnosis codes, treatment group, age, sex, and a rare genetic marker. The data must be deposited in a public repository as required by the funder.
Add solutions
Learning Objective
- After completing this part of the tutorial, you will be able to choose a suitable technique based on your data.
Exercises
- Give short examples of datasets from various contexts and ask for the best anonymization strategy