Personal Data

Learning Objective

After completing this part of the tutorial, you will be able to distinguish between personal data and non-personal data, as well as sensitive and non-sensitive data, and be able to identify direct and indirect identifiers.

Watch this 3-minute video that explains what personal data are:

According to Article 4(1) GDPR, personal data is defined as any information relating to an identified or identifiable natural person (also named data subject). This definition is a cornerstone of privacy regulations, particularly the GDPR, and extends far beyond obvious identifiers.

An individual is considered identifiable if they can be recognized, either directly or indirectly, through various identifiers or factors. These identifiers can include:

Direct identifiers. These are information that can directly point out and identify an individual, such as address, social security number, bank accounts, or an email address.
Indirect or quasi identifiers. These are data that can, when combined with other pieces of data, lead to the identification of an individual. Examples include date of birth, age, gender, geographic location (like a ZIP or postal code), marital status, or details about events (e.g., hospital admission dates).

The assessment of whether a person is identifiable takes into account all means reasonably likely to be used by the data controller or another person to identify the natural person, including methods such as “singling out”. This “reasonably likely” criterion considers objective factors such as costs, time, effort, and the technological means available at the time of processing, as well as potential future technological developments. For instance, a dynamic IP address can qualify as personal data if it can be linked to a specific person, even if that linking capability resides with an Internet Service Provider and requires a court order (Breyer v Bundesrepublik Deutschland 2016).

In a research context, this means that even datasets without names or email addresses can contain personal data. A dataset with age, gender, postal code, and occupation might seem harmless—but in the case of rare occupations, an attacker might be able to single individuals out using just these variables. This is especially relevant for smaller or more specialized samples (e.g., employees of a specific company, students in a specific program) where combinations of demographic variables become more unique. It is also relevant when your data contains grouped observations—for example, couples, families, or school classes—since group members share certain attributes and may use their knowledge about other members to identify them in the dataset.

Special Types of Personal Data

The GDPR (Art. 9) describes several categories of sensitive data that receive heightened protection:

racial or ethnic origin,
political opinions,
religious or philosophical beliefs,
trade union membership,
genetic data, biometric data (for unique identification),
health data,
and data concerning a natural person’s sex life or sexual orientation.

In practice, this means you need to be extra careful with these variables when sharing data—they require stronger protection measures, and in some cases, it may be advisable to remove them from a shared dataset entirely if they are not essential to the research question.

Information on criminal convictions and offences is a special case. Under Art. 10 GDPR, such data may only be processed under the control of an official authority. The GDPR leaves room for national exemptions, but Germany (and e.g., Bavaria) has not created a clear, general basis for processing this data in research. If your work requires collecting data on criminal convictions, I recommend talking to your institution’s data protection officer.

Exercise: Personal Data

In this first exercise, we work with our dataset—the simulated dataset of 200 Germans introduced on the landing page, whose purpose is to answer whether certain political opinions are linked to religion. If you have not done so yet, download it and import it into R as described there. As a reminder, here is its data dictionary:

Data Dictionary

Variable Name	Description	Item	Values
id	Number assigned to each participant in order of participation	(assigned in background)	Integer; 1-200
name	First and last name of participant	Please indicate your full name (first and last name)	String of characters
email	Email address of participant	What is your e-mail address?	String of characters
plz	German postal code	What is your postal code?	String of characters
gender	Gender of participant	What is your gender?	Factor; “male”/“female”/“non-binary”
age	Age of participant in years	What is your age in years?	Integer; 18-100
income	Personal annual income in Euros	What was your income over the last twelve months	Integer
religion	Religion of participant	What is your religion?	Factor; “Catholicism”,“Protestantism”,“Islam”,“Eastern Orthodoxy”,“Judaism”, “Buddhism”, “Hinduism”, “Other”, “None”
education	Highest degree of education	What is your highest degree of education?	Factor; “no degree”,“trade school”,“high school”, “university”,“doctoral title”
pol_immigration	Likert item measuring opinion on immigration	The government should limit immigration more strictly than it currently does.	Integer; 1-5
pol_environment	Likert item measuring opinion on environment	Protecting the environment should be a top priority, even if it slows economic growth.	Integer; 1-5
pol_redistribution	Likert item measuring opinion on redistribution of wealth	The government should reduce income differences between rich and poor.	Integer; 1-5
pol_eu_integration	Likert item measuring opinion on membership in EU	Our country benefits from being a member of the European Union.	Integer; 1-5
ip_address	IP address (version 4) of participant’s device when answering survey	(collected in background)	String of characters
years_in_job	Number of years the participant has been in their current job	How many years have you been in your current job?	Integer; 0-n

Exercise: Direct Identifiers, Indirect Identifiers, and Special Categories

Inspect the dataset data. Answer the following questions:

1. Which columns contain direct identifiers?

Solution

Direct identifiers: `name`, `email`, `ip_address`

Name is probably the most obvious identifier there is.

Email addresses are considered direct identifiers since they often contain names and therefore, clearly specify certain individuals. However, even when not containing names, they can act as indirect identifiers by linking to an account or revealing a person’s organization.

IP addresses can be linked back to specific devices and, through Internet service providers, to individuals. The Court of Justice of the EU has ruled that even dynamic IP addresses can constitute personal data when the entity holding them has the legal means to obtain identifying information from the Internet service provider (Breyer v Bundesrepublik Deutschland 2016). In a research context, they should be treated as direct identifiers.

2. Which columns contain indirect identifiers?

Solution

Indirect identifiers: `gender`, `age`, `education`, `income`, `plz`, `years_in_job`

Gender, age, education, postal code, income, and years in job are indirect identifiers, since they cannot lead to the identification of an individual on their own, but may be linked with each other or external knowledge. When an attacker knows of a person who participated in the study and, for instance, knows their education level, they could identify the person in the data.

3. Which columns contain special categories of personal data (= sensitive data)?

Solution

Special category data: all `pol\_` variables (political opinion), `religion`

The pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration variables capture political opinions, and religion captures religious beliefs. Both categories are explicitly listed as sensitive data under Art. 9 GDPR. These variables require heightened protection because their disclosure could lead to discrimination or other harm for the individuals involved.

4. Is there any column that does not contain personal data?

Solution

Personal data: All columns contain personal data.

Any data that can be linked to an individual is considered personal data. As long as identification (e.g., via direct identifiers such as name) is possible, it is considered personal data, and GDPR applies.

As a first step, you always want to remove the direct identifiers as soon as you don’t need them anymore. I would recommend performing this step on any existing copy of the dataset. Remember to also check your data collection platform and data shared with collaborators.

More generally, it helps to follow a consistent workflow for handling identifiers from the moment data comes in until you publish. The following is a good default:

An optimal workflow for handling identifiers

Download the raw data from your collection tool.
Remove direct identifiers as soon as you no longer need them—in every copy that exists (collection platform, cloud storage, collaborators’ copies, your own machine). Only keep them if there is a concrete, documented purpose, such as accounting, returning results to participants, or a genuine third-party requirement.
Remove the pseudonym key using the same logic. If either direct identifiers or a pseudonym key would serve the purpose, keep the key—it is the safer option—but store it separately from the data, with restricted access and a defined deletion date.
Store a secure master copy of the data without direct identifiers and pseudonyms—e.g., read-only on your institution’s password-protected server. Generate your working copy from it with a script, so the cleaning step is reproducible and auditable.
Analyze, write, and share with collaborators using this identifier-free working copy.
Anonymize the data if needed, following the steps explained in this tutorial.
Publish the anonymized version of the data.

Myth: “The funder wants the full raw data”

It is a common misconception that funding agencies require you to archive the complete raw dataset including direct identifiers. In reality, they almost always want the raw data without them. Never retain identifiers on this assumption—always verify that it is a real, written requirement before treating it as a reason to keep identifying information.

Exercise: Remove Direct Identifiers

Now, delete all direct identifiers from the dataset.

Solution

Any solution that deletes the name, email address, and IP address is correct. I use tidyverse syntax:

# Remove direct identifiers from the dataset
library(tidyverse)

data_withoutdirectidentifiers <- data %>%
  select(-name, -email, -ip_address)

Save the new file

# Save the dataset without direct identifiers
write.csv(data_withoutdirectidentifiers, here::here("SimulatedData_noidentifiers.csv"), row.names = FALSE)

Delete the old file—unless one of the retention purposes above applies. Direct identifiers are often collected for a reason (e.g., a name to honor a withdrawal request, or an email to return results or flag an incidental finding), so the rule is not “always delete immediately” but “delete once the documented purpose is gone.” Where a purpose genuinely requires keeping identifiers or a pseudonym key, store that piece separately from the analysis data, restrict access to it, and set a date to delete it.

Resources, Links, Examples

In their tutorial, Van Ravenzwaaij et al. (2025) present example datasets and how to categorize variables into these categories.

References

Reference for a Preliminary Ruling — Processing of Personal Data — Directive 95/46/EC — Article 2(a) — Article 7(f) — Definition of “Personal Data” — Internet Protocol Addresses — Storage of Data by an Online Media Services Provider — National Legislation Not Permitting the Legitimate Interest Pursued by the Controller to Be Taken into Account). 2016. C-582/14.

Van Ravenzwaaij, Don, Marlon De Jong, Rink Hoekstra, et al. 2025. “De-Identification When Making Data Sets Findable, Accessible, Interoperable, and Eusable (FAIR): Two Worked Examples from the Behavioral and Social Sciences.” Advances in Methods and Practices in Psychological Science 8 (2): 1–23. https://doi.org/10.1177/25152459251336130.

Special Types of Personal Data

Exercise: Personal Data

Exercise: Direct Identifiers, Indirect Identifiers, and Special Categories

Direct identifiers: name, email, ip_address

Indirect identifiers: gender, age, education, income, plz, years_in_job

Special category data: all pol\_ variables (political opinion), religion

Personal data: All columns contain personal data.

Exercise: Remove Direct Identifiers

Resources, Links, Examples

References

Direct identifiers: `name`, `email`, `ip_address`

Indirect identifiers: `gender`, `age`, `education`, `income`, `plz`, `years_in_job`

Special category data: all `pol\_` variables (political opinion), `religion`