Documentation
- After completing this part of the tutorial, you will know best practices on how to document your anonymization process.
Goals of Documentation of Anonymization
Thorough documentation of your anonymization process serves two purposes:
- It provides an internal record for auditing - if questions arise later about how the data was handled, you can retrace every step.
- It provides transparency for data users, who need to understand what was changed in order to interpret the data correctly.
Internal Documentation
Your internal documentation should describe the full anonymization workflow: which variables were modified, which techniques were applied (and with what parameters), what risk levels were measured before and after, and who made the decisions. Think of it as a lab notebook for your data processing. This document should be stored securely alongside the original data - it is not meant for publication, since it may contain information that could aid re-identification (e.g., the exact thresholds used for top-coding). In our case, this is the full, well-commented R script.
External Documentation in Data Dictionary
The external documentation is what you share alongside the anonymized dataset. It should include a data dictionary describing every variable in the released dataset, including any changes from the original. For example, if age was recoded from exact years to age bands, the codebook should state this clearly. Be careful not to reveal details that could help reverse the anonymization.
If you publish your anonymization scripts (which is good practice for reproducibility), review them carefully. Scripts can accidentally reveal information about suppressed or recoded values - for example, a line like filter(country == "Liechtenstein") tells an attacker that someone from Liechtenstein was in the original data. Use generic variable references where possible.
Exercise
Make changes to the data dictionary from the chapter on personal data in accordance with the anonymization steps you made in this tutorial.
The following variables were removed or modified during anonymization:
| Variable Name | Status | Change |
|---|---|---|
name |
Removed | Direct identifier - dropped before release |
email |
Removed | Direct identifier - dropped before release |
ip_address |
Removed | Direct identifier - dropped before release |
id |
Unchanged | |
age |
Modified | Recoded from exact age in years (Integer; 18-100) to age band (Factor; “18-29”, “30-39”, “40-49”, “50-59”, “60+”) |
plz |
Modified | Recoded from full 5-digit postal code to regional indicator (Factor; “8xxxx” for Munich/southwest German region, “other” for all other codes); local suppression of 2 values that violate 3-anonymity |
religion |
Modified | Categories with fewer than 10 observations merged into “Other” (Factor; “Catholicism”, “Protestantism”, “Islam”, “None”, “Other”) |
income |
Modified | Values replaced with group means via microaggregation; individual values no longer correspond to original responses (Integer) |
years_in_job |
Modified | Values replaced with group means via microaggregation; individual values no longer correspond to original responses (Integer) |
gender |
Modified | Local suppression of 6 values that violate 3-anonymity |
education |
Modified | Local suppression of 40 values that violate 3-anonymity |
pol_immigration |
Unchanged | |
pol_environment |
Unchanged | |
pol_redistribution |
Unchanged | |
pol_eu_integration |
Unchanged |
Note that the anonymization script used to produce this dataset is available alongside the data. Be aware that scripts themselves can reveal information about the original data (e.g., thresholds used for top-coding, or country names referenced in recoding steps). The released script uses generic category labels where possible to avoid this.