Welcome to the Data Anonymization Course!

This website is a work in progress. Please come back later in 2026 :)

Tutorial Overview

This self-paced tutorial on anonymization of quantitative research data is intended to take about three hours to complete.

The tutorial is split into the following sections:

FOUNDATIONS OF DATA PROTECTION talks about data protection basics in ethics and law, mechanisms of data protection in research, and basic terms.
DATA ANONYMIZATION PROCESS walks you through the process of anonymizing your research data based on example data.
BALANCING DATA PROTECTION AND OPENNESS presents methods for aligning your data protection and open science interests.
ANONYMIZATION WORKFLOW closes this tutorial by summarizing the learned workflow.

What You’ll Learn

By the end of this tutorial, you will be able to:

Understand key concepts in the world of privacy (e.g., anonymization, k-anonymity)
Classify data in relevant categories for data protection (e.g., personal data, sensitive data)
Apply anonymization techniques using R in a coherent workflow
Make informed decisions when balancing the risks and utility of the anonymized data

What You Will NOT Learn

You will not learn anything other than anonymization of quantitative tabular data.

Here are a few helpful links for other data types:

Anonymizing neuroimaging metadata (article)
Anonymizing qualitative textual data with the tool QualiAnon (video tutorial)
Anonymizing sensitive qualitative data (lecture recording and article)

Prerequisites

You need basic R skills (e.g., loading data and packages). Experience with data wrangling with tidyverse is beneficial.

Required Software

To follow along with the hands-on exercises, you will need a recent version of R (version 4.1.0 or newer) and RStudio.

You will also need a handful of R packages. Install them with:

install.packages("sdcMicro")  # measure re-identification risk and apply anonymization techniques
install.packages("tidyverse") # data wrangling and plotting, used throughout the exercises
install.packages("here")      # locate your data file regardless of the working directory
install.packages("knitr")     # render tables in the tutorial's examples
install.packages("xfun")      # small helpers used in the tutorial's examples

The Example Dataset

Throughout the tutorial, we work with the same simulated dataset of 200 Germans. The data’s purpose is to answer whether certain political opinions are linked to religion. We will assess its privacy risks and then anonymize it step by step across the exercises. Download it here:

⬇️ Download SimulatedData.csv

Once downloaded, copy this into a new Markdown file in RStudio to import the data for further analysis:

# Load data based on downloaded file
data <- read.csv(here::here("SimulatedData.csv")) # Change based on the location of your data file

The dataset contains the following variables:

Data Dictionary

Variable Name	Description	Item	Values
id	Number assigned to each participant in order of participation	(assigned in background)	Integer; 1-200
name	First and last name of participant	Please indicate your full name (first and last name)	String of characters
email	Email address of participant	What is your e-mail address?	String of characters
plz	German postal code	What is your postal code?	String of characters
gender	Gender of participant	What is your gender?	Factor; “male”/“female”/“non-binary”
age	Age of participant in years	What is your age in years?	Integer; 18-100
income	Personal annual income in Euros	What was your income over the last twelve months	Integer
religion	Religion of participant	What is your religion?	Factor; “Catholicism”,“Protestantism”,“Islam”,“Eastern Orthodoxy”,“Judaism”, “Buddhism”, “Hinduism”, “Other”, “None”
education	Highest degree of education	What is your highest degree of education?	Factor; “no degree”,“trade school”,“high school”, “university”,“doctoral title”
pol_immigration	Likert item measuring opinion on immigration	The government should limit immigration more strictly than it currently does.	Integer; 1-5
pol_environment	Likert item measuring opinion on environment	Protecting the environment should be a top priority, even if it slows economic growth.	Integer; 1-5
pol_redistribution	Likert item measuring opinion on redistribution of wealth	The government should reduce income differences between rich and poor.	Integer; 1-5
pol_eu_integration	Likert item measuring opinion on membership in EU	Our country benefits from being a member of the European Union.	Integer; 1-5
ip_address	IP address (version 4) of participant’s device when answering survey	(collected in background)	String of characters
years_in_job	Number of years the participant has been in their current job	How many years have you been in your current job?	Integer; 0-n