Personal Data
- After completing this part of the tutorial, you will be able to distinguish between personal data and non-personal data, as well as sensitive and non-sensitive data, and be able to identify direct and indirect identifiers.
Watch this 3-minute video that explains what personal data are:
According to Article 4(1) GDPR, personal data is defined as any information relating to an identified or identifiable natural person (also named data subject). This definition is a cornerstone of privacy regulations, particularly the GDPR, and extends far beyond obvious identifiers.
An individual is considered identifiable if they can be recognized, either directly or indirectly, through various identifiers or factors. These identifiers can include:
Direct identifiers. These are information that can directly point out and identify an individual, such as address, social security number, bank accounts, or an email address.
Indirect or quasi identifiers. These are data that can, when combined with other pieces of data, lead to the identification of an individual. Examples include date of birth, age, gender, geographic location (like a ZIP or postal code), marital status, or details about events (e.g., hospital admission dates).
The assessment of whether a person is identifiable takes into account all means reasonably likely to be used by the data controller or another person to identify the natural person, including methods such as “singling out”. This “reasonably likely” criterion considers objective factors such as costs, time, effort, and the technological means available at the time of processing, as well as potential future technological developments. For instance, a dynamic IP address can qualify as personal data if it can be linked to a specific person, even if that linking capability resides with an Internet Service Provider and requires a court order (Breyer v Bundesrepublik Deutschland 2016).
In a research context, this means that even datasets without names or email addresses can contain personal data. A dataset with age, gender, postal code, and occupation might seem harmless - but in the case of rare occupations, an attacker might be able to single individuals out using just these variables. This is especially relevant for smaller or more specialized samples (e.g., employees of a specific company, students in a specific program) where combinations of demographic variables become more unique. It is also relevant when your data contains grouped observations - for example, couples, families, or school classes - since group members share certain attributes and may use their knowledge about other members to identify them in the dataset.
Special Types of Personal Data
The GDPR (Art. 9) describes several categories of sensitive data that receive heightened protection:
- racial or ethnic origin,
- political opinions,
- religious or philosophical beliefs,
- trade union membership,
- genetic data, biometric data (for unique identification),
- health data,
- and data concerning a natural person’s sex life or sexual orientation.
Otherwise sensitive data include:
- information about criminal convictions and offenses
- financial data
In practice, this means you need to be extra careful with these variables when sharing data - they require stronger protection measures, and in some cases, it may be advisable to remove them from a shared dataset entirely if they are not essential to the research question.
Exercise: Personal Data
Here is a simulated dataset of 200 Germans. The data’s purpose is to answer whether certain political opinions are linked to religion. This dataset will be used throughout the exercises.
Download the dataset here:
Copy this to a new Markdown file in RStudio to import the data for further analysis:
# Load data based on downloaded file
data <- read.csv("../SimulatedData.csv") # Change based on the location of your data file| Variable Name | Description | Item | Values |
|---|---|---|---|
| id | Number assigned to each participant in order of participation | (assigned in background) | Integer; 1-200 |
| name | First and last name of participant | Please indicate your full name (first and last name) | String of characters |
| Email address of participant | What is your e-mail address? | String of characters | |
| plz | German postal code | What is your postal code? | String of characters |
| gender | Gender of participant | What is your gender? | Factor; “male”/“female”/“non-binary” |
| age | Age of participant in years | What is your age in years? | Integer; 18-100 |
| income | Personal annual income in Euros | What was your income over the last twelve months | Integer |
| religion | Religion of participant | What is your religion? | Factor; “Catholicism”,“Protestantism”,“Islam”,“Eastern Orthodoxy”,“Judaism”, “Buddhism”, “Hinduism”, “Other”, “None” |
| education | Highest degree of education | What is your highest degree of education? | Factor; “no degree”,“trade school”,“high school”, “university”,“doctoral title” |
| pol_immigration | Likert item measuring opinion on immigration | The government should limit immigration more strictly than it currently does. | Integer; 1-5 |
| pol_environment | Likert item measuring opinion on environment | Protecting the environment should be a top priority, even if it slows economic growth. | Integer; 1-5 |
| pol_redistribution | Likert item measuring opinion on redistribution of wealth | The government should reduce income differences between rich and poor. | Integer; 1-5 |
| pol_eu_integration | Likert item measuring opinion on membership in EU | Our country benefits from being a member of the European Union. | Integer; 1-5 |
| ip_address | IP address (version 4) of participant’s device when answering survey | (collected in background) | String of characters |
| years_in_job | Number of years the participant has been in their current job | How many years have you been in your current job? | Integer; 0–n |
Exercise: Direct Identifiers, Indirect Identifiers, and Special Categories
Inspect the dataset data. Answer the following questions:
1. Which columns contain direct identifiers?
Direct identifiers: name, email, ip_address
Name is probably the most obvious identifier there is.
Email addresses are considered direct identifiers since they often contain names and therefore, clearly specify certain individuals. However, even when not containing names, they can act as indirect identifiers by linking to an account or revealing a person’s organization.
IP addresses can be linked back to specific devices and, through Internet service providers, to individuals. The Court of Justice of the EU has ruled that even dynamic IP addresses can constitute personal data when the entity holding them has the legal means to obtain identifying information from the ISP (Breyer v Bundesrepublik Deutschland 2016). In a research context, they should be treated as direct identifiers.
2. Which columns contain indirect identifiers?
Indirect identifiers: gender, age, education, income, plz, years_in_job
Gender, age, education, postal code, income, and years in job are indirect identifiers, since they cannot lead to the identification of an individual on their own, but may be linked with each other or external knowledge. When an attacker knows of a person who participated in the study and knows their education level, they could identify the person in the data.
3. Which columns contain special categories of personal data (= sensitive data)?
Special category data: all pol\_ variables (political opinion), religion
The pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration variables capture political opinions, and religion captures religious beliefs. Both categories are explicitly listed as sensitive data under Art. 9 GDPR. These variables require heightened protection because their disclosure could lead to discrimination or other harm for the individuals involved.
4. Is there any column that does not contain personal data?
Personal data: All columns contain personal data.
Any data that can be linked to an individual is considered personal data. As long as identification (e.g., via direct identifiers such as name) is possible, it is considered personal data, and GDPR applies.
As a first step, you always want to remove the direct identifiers as soon as you don’t need them anymore. I would recommend performing this step on any existing copy of the dataset. Remember to also check your data collection platform and data shared with collaborators.
Exercise: Remove Direct Identifiers
Now, delete all direct identifiers from the dataset.
Any solution that deletes the name, email address, and IP address is correct. I use tidyverse syntax:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Save the new file
write.csv(data_withoutdirectidentifiers, "../SimulatedData_noidentifiers.csv", row.names = FALSE)Delete the old file.
Resources, Links, Examples
- In their tutorial, Van Ravenzwaaij et al. (2025) present example datasets and how to categorize variables into these categories.