Personal Data
Watch this video that explains what personal data are:
According to Article 4(1) GDPR, personal data is defined as any information relating to an identified or identifiable natural person (also named data subject.) This definition is a cornerstone of privacy regulations, particularly the EU General Data Protection Regulation (GDPR), and extends far beyond obvious identifiers.
An individual is considered identifiable if they can be recognized, either directly or indirectly, through various identifiers or factors. These identifiers can include:
Direct identifiers. These are information that can directly point out and identify an individual, such as address, social security number, bank accounts, or an email address.
Indirect or quasi identifiers. These are data that can, when combined with other pieces of data, lead to the identification of an individual. Examples include date of birth, age, gender, geographic location (like a ZIP or postal code), marital status, or details about events (e.g., admission dates, procedure codes).
The assessment of whether a person is identifiable takes into account all means reasonably likely to be used by the data controller or another person to identify the natural person, including methods such as “singling out”. This “reasonably likely” criterion considers objective factors such as costs, time, effort, and the technological means available at the time of processing, as well as potential future technological developments. For instance, a dynamic IP address can qualify as personal data if it can be linked to a specific person, even if that linking capability resides with an Internet Service Provider and requires a court order Breyer v Bundesrepublik Deutschland (2016)].
In a research context, this means that even datasets without names or email addresses can contain personal data. A dataset with age, gender, postal code, and occupation might seem harmless - but if someone knows that a particular person participated in your study, they might be able to single them out using just these variables. This is especially relevant for smaller or more specialized samples (e.g., employees of a specific company, students in a specific program) where combinations of demographic variables become more unique. It is also relevant when your data contains grouped observations - for example, couples, families, or school classes - since group members share certain attributes and may use their knowledge about other members to identify them in the dataset.
Special Types of Personal Data
The GDPR (Art. 9) describes several categories of sensitive data that receive heightened protection:
- racial or ethnic origin,
- political opinions,
- religious or philosophical beliefs,
- trade union membership,
- genetic data, biometric data (for unique identification),
- health data,
- and data concerning a natural person’s sex life or sexual orientation.
Otherwise sensitive data include:
- information about criminal convictions and offenses
- financial data
In practice, this means you need to be extra careful with these variables when sharing data - they require stronger anonymization measures, and in some cases it may be advisable to remove them from a shared dataset entirely if they are not essential to the research question.
Exercise: Personal Data
Here is a simulated dataset of 200 Germans. The data’s purpose is to answer whether certain political opinions are linked to religion. This dataset will be used throughout the exercises.
Download the dataset here:
Copy this to a new Markdown file in RStudio to import the data for further analysis:
library(tidyverse)
library(readxl)
library(writexl)
# Load data based on downloaded file
data <- read_xlsx("../SimulatedData.xlsx") # Change based on the location of your data file
data# A tibble: 200 × 16
id name email plz gender age income years_in_job religion job_title
<dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 1 Julie … juli… 38533 female 90 7.56e4 11 Protest… Conserva…
2 2 Doreen… dore… 80539 female 88 1.02e5 3 Protest… Scientis…
3 3 Josef … jose… 81539 male 20 6.52e4 11 Protest… Designer…
4 4 Rosa R… rosa… 38547 female 23 7.18e4 3 Protest… Public a…
5 5 Petra … petr… 10719 female 73 4.40e4 7 Catholi… Counsell…
6 6 Mattia… matt… 86473 male 57 1.02e5 21 None Film/vid…
7 7 Hans-W… hans… 91583 male 72 6.44e4 12 None IT consu…
8 8 Sylvan… sylv… 80331 female 18 3.18e4 9 Protest… Investme…
9 9 Camill… cami… 74542 female 64 7.18e4 3 Judaism Solicitor
10 10 Herlin… herl… 29223 female 36 5.73e4 10 None Mental h…
# ℹ 190 more rows
# ℹ 6 more variables: education <chr>, pol_immigration <dbl>,
# pol_environment <dbl>, pol_redistribution <dbl>, pol_eu_integration <dbl>,
# ip_address <chr>
| Variable Name | Description | Item | Values |
|---|---|---|---|
| id | Number assigned to each participant in order of participation | (assigned in background) | Integer; 1-200 |
| name | First and last name of participant | Please indicate your full name (first and last name) | String of characters |
| Email address of participant | What is your e-mail address? | String of characters | |
| plz | German postal code | What is your postal code? | String of characters |
| gender | Gender of participant | What is your gender? | Factor; “male”/“female”/“non-binary” |
| age | Age of participant in years | What is your age in years? | Integer; 18-100 |
| income | Personal annual income in Euros | What was your income over the last twelve months | Integer |
| religion | Religion of participant | What is your religion? | Factor; “Catholicism”,“Protestantism”,“Islam”,“Eastern Orthodoxy”,“Judaism”, “Buddhism”, “Hinduism”, “Other”, “None” |
| job_title | Title of job of partcipant | What is your job title? | String of characters |
| education | Highest degree of education | What is your highest degree of education? | Factor; “no degree”,“trade school”,“high school”, “university”,“doctoral title” |
| pol_immigration | Likert item measuring opinion on immigration | The government should limit immigration more strictly than it currently does. | Integer; 1-5 |
| pol_environment | Likert item measuring opinion on environment | Protecting the environment should be a top priority, even if it slows economic growth. | Integer; 1-5 |
| pol_redistribution | Likert item measuring opinion on redistribution of wealth | The government should reduce income differences between rich and poor. | Integer; 1-5 |
| pol_eu_integration | Likert item measuring opinion on membership in EU | Our country benefits from being a member of the European Union. | Integer; 1-5 |
| ip_address | IP address (version 4) of participant’s device when answering survey | (collected in background) | String of characters |
| years_in_job | Number of years the participant has been in their current job | How many years have you been in your current job? | Integer; 0–n |
Exercise: Direct Identifiers, Indirect Identifiers, and Special Categories
Inspect the dataset data. Answer the following questions:
1. Which columns contain direct identifiers?
2. Which contains indirect identifiers?
3. Which contain special categories of personal data?
4. Is there any column that does not contain personal data?
Direct identifiers: name, email, ip_address
Name is probably the most obvious identifier there is.
Email addresses are considered direct identifiers since they often contain names and therefore, clearly specify certain individuals. However, even when not containing names, they can act as indirect identifiers by linking to an account or revealing a person’s organization.
IP addresses can be linked back to specific devices and, through Internet service providers, to individuals. The Court of Justice of the EU has ruled that even dynamic IP addresses can constitute personal data when the entity holding them has the legal means to obtain identifying information from the ISP (Breyer v Bundesrepublik Deutschland 2016). In a research context, they should be treated as direct identifiers.
Indirect identifiers: gender, age, education, job_title, income, plz, years_in_job
Gender, age, education, job title, postal code, income, and years in job are indirect identifiers, since they cannot lead to idenitification of an individual on their own but may be linked with each other or external knowledge. When an attacker knows of a person who participated in the study and knows their job title, they could identify the person in the data.
Special category data: all pol\_ variables (political opinion), religion
The pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration variables capture political opinions, and religion captures religious beliefs. Both categories are explicitly listed as sensitive data under Art. 9 GDPR. These variables require heightened protection because their disclosure could lead to discrimination or other harm for the individuals involved.
Personal data: All columns contain personal data.
Any data that can be linked to an individual is considered personal data. As long as identification (e.g., via direct identifiers such as name) is possible, it is considered personal data and GDPR applies.
Exercise: Remove Direct Identifiers
Now, delete all direct identifiers from the dataset.
Any solution that deletes the name, email address, and IP address is correct. I use tidyverse syntax:
data_withoutdirectidentifiers <- data %>%
select(-name, -email, -ip_address)Save the new file
write_xlsx(data_withoutdirectidentifiers, "../SimulatedData_noidentifiers.xlsx")Learning Objective
- After completing this part of the tutorial, you will be able to distinguish between personal data and non-personal data, as well as sensitive and non-sensitive data, and be able to identify direct and indirect identifiers.
Exercises
- Identify variables that contain direct identifiers, indirect identifiers, and sensitive data
Resources, Links, Examples
- examples for how to categorize data: Van Ravenzwaaij et al. (2025)