Personal Data

Watch this video that explains what personal data are:

According to Article 4(1) GDPR, personal data is defined as any information relating to an identified or identifiable natural person (also named data subject.) This definition is a cornerstone of privacy regulations, particularly the EU General Data Protection Regulation (GDPR), and extends far beyond obvious identifiers.

An individual is considered identifiable if they can be recognized, either directly or indirectly, through various identifiers or factors. These identifiers can include:

The assessment of whether a person is identifiable takes into account all means reasonably likely to be used by the data controller or another person to identify the natural person, including methods such as “singling out”. This “reasonably likely” criterion considers objective factors such as costs, time, effort, and the technological means available at the time of processing, as well as potential future technological developments. For instance, a dynamic IP address can qualify as personal data if it can be linked to a specific person, even if that linking capability resides with an Internet Service Provider and requires a court order Breyer v Bundesrepublik Deutschland (2016)].

In a research context, this means that even datasets without names or email addresses can contain personal data. A dataset with age, gender, postal code, and occupation might seem harmless - but if someone knows that a particular person participated in your study, they might be able to single them out using just these variables. This is especially relevant for smaller or more specialized samples (e.g., employees of a specific company, students in a specific program) where combinations of demographic variables become more unique. It is also relevant when your data contains grouped observations - for example, couples, families, or school classes - since group members share certain attributes and may use their knowledge about other members to identify them in the dataset.

Special Types of Personal Data

The GDPR (Art. 9) describes several categories of sensitive data that receive heightened protection:

  • racial or ethnic origin,
  • political opinions,
  • religious or philosophical beliefs,
  • trade union membership,
  • genetic data, biometric data (for unique identification),
  • health data,
  • and data concerning a natural person’s sex life or sexual orientation.

Otherwise sensitive data include:

  • information about criminal convictions and offenses
  • financial data

In practice, this means you need to be extra careful with these variables when sharing data - they require stronger anonymization measures, and in some cases it may be advisable to remove them from a shared dataset entirely if they are not essential to the research question.

Exercise: Personal Data

Here is a simulated dataset of 200 Germans. The data’s purpose is to answer whether certain political opinions are linked to religion. This dataset will be used throughout the exercises.

Download the dataset here:

Copy this to a new Markdown file in RStudio to import the data for further analysis:

library(tidyverse)
library(readxl)
library(writexl)

# Load data based on downloaded file
data <- read_xlsx("../SimulatedData.xlsx") # Change based on the location of your data file

data
# A tibble: 200 × 16
      id name    email   plz gender   age income years_in_job religion job_title
   <dbl> <chr>   <chr> <dbl> <chr>  <dbl>  <dbl>        <dbl> <chr>    <chr>    
 1     1 Julie … juli… 38533 female    90 7.56e4           11 Protest… Conserva…
 2     2 Doreen… dore… 80539 female    88 1.02e5            3 Protest… Scientis…
 3     3 Josef … jose… 81539 male      20 6.52e4           11 Protest… Designer…
 4     4 Rosa R… rosa… 38547 female    23 7.18e4            3 Protest… Public a…
 5     5 Petra … petr… 10719 female    73 4.40e4            7 Catholi… Counsell…
 6     6 Mattia… matt… 86473 male      57 1.02e5           21 None     Film/vid…
 7     7 Hans-W… hans… 91583 male      72 6.44e4           12 None     IT consu…
 8     8 Sylvan… sylv… 80331 female    18 3.18e4            9 Protest… Investme…
 9     9 Camill… cami… 74542 female    64 7.18e4            3 Judaism  Solicitor
10    10 Herlin… herl… 29223 female    36 5.73e4           10 None     Mental h…
# ℹ 190 more rows
# ℹ 6 more variables: education <chr>, pol_immigration <dbl>,
#   pol_environment <dbl>, pol_redistribution <dbl>, pol_eu_integration <dbl>,
#   ip_address <chr>
NoteData Dictionary
Variable Name Description Item Values
id Number assigned to each participant in order of participation (assigned in background) Integer; 1-200
name First and last name of participant Please indicate your full name (first and last name) String of characters
email Email address of participant What is your e-mail address? String of characters
plz German postal code What is your postal code? String of characters
gender Gender of participant What is your gender? Factor; “male”/“female”/“non-binary”
age Age of participant in years What is your age in years? Integer; 18-100
income Personal annual income in Euros What was your income over the last twelve months Integer
religion Religion of participant What is your religion? Factor; “Catholicism”,“Protestantism”,“Islam”,“Eastern Orthodoxy”,“Judaism”, “Buddhism”, “Hinduism”, “Other”, “None”
job_title Title of job of partcipant What is your job title? String of characters
education Highest degree of education What is your highest degree of education? Factor; “no degree”,“trade school”,“high school”, “university”,“doctoral title”
pol_immigration Likert item measuring opinion on immigration The government should limit immigration more strictly than it currently does. Integer; 1-5
pol_environment Likert item measuring opinion on environment Protecting the environment should be a top priority, even if it slows economic growth. Integer; 1-5
pol_redistribution Likert item measuring opinion on redistribution of wealth The government should reduce income differences between rich and poor. Integer; 1-5
pol_eu_integration Likert item measuring opinion on membership in EU Our country benefits from being a member of the European Union. Integer; 1-5
ip_address IP address (version 4) of participant’s device when answering survey (collected in background) String of characters
years_in_job Number of years the participant has been in their current job How many years have you been in your current job? Integer; 0–n

Exercise: Direct Identifiers, Indirect Identifiers, and Special Categories

Inspect the dataset data. Answer the following questions:

1. Which columns contain direct identifiers?

2. Which contains indirect identifiers?

3. Which contain special categories of personal data?

4. Is there any column that does not contain personal data?

NoteSolution

Direct identifiers: name, email, ip_address

Name is probably the most obvious identifier there is.

Email addresses are considered direct identifiers since they often contain names and therefore, clearly specify certain individuals. However, even when not containing names, they can act as indirect identifiers by linking to an account or revealing a person’s organization.

IP addresses can be linked back to specific devices and, through Internet service providers, to individuals. The Court of Justice of the EU has ruled that even dynamic IP addresses can constitute personal data when the entity holding them has the legal means to obtain identifying information from the ISP (Breyer v Bundesrepublik Deutschland 2016). In a research context, they should be treated as direct identifiers.

Indirect identifiers: gender, age, education, job_title, income, plz, years_in_job

Gender, age, education, job title, postal code, income, and years in job are indirect identifiers, since they cannot lead to idenitification of an individual on their own but may be linked with each other or external knowledge. When an attacker knows of a person who participated in the study and knows their job title, they could identify the person in the data.

Special category data: all pol\_ variables (political opinion), religion

The pol_immigration, pol_environment, pol_redistribution, and pol_eu_integration variables capture political opinions, and religion captures religious beliefs. Both categories are explicitly listed as sensitive data under Art. 9 GDPR. These variables require heightened protection because their disclosure could lead to discrimination or other harm for the individuals involved.

Personal data: All columns contain personal data.

Any data that can be linked to an individual is considered personal data. As long as identification (e.g., via direct identifiers such as name) is possible, it is considered personal data and GDPR applies.

Exercise: Remove Direct Identifiers

Now, delete all direct identifiers from the dataset.

NoteSolution

Any solution that deletes the name, email address, and IP address is correct. I use tidyverse syntax:

data_withoutdirectidentifiers <- data %>% 
  select(-name, -email, -ip_address)

Save the new file

write_xlsx(data_withoutdirectidentifiers, "../SimulatedData_noidentifiers.xlsx")

Learning Objective

  • After completing this part of the tutorial, you will be able to distinguish between personal data and non-personal data, as well as sensitive and non-sensitive data, and be able to identify direct and indirect identifiers.

Exercises

  • Identify variables that contain direct identifiers, indirect identifiers, and sensitive data
Back to top

References

Reference for a Preliminary Ruling — Processing of Personal Data — Directive 95/46/ECArticle 2(a) — Article 7(f) — Definition of “Personal Data”Internet Protocol Addresses — Storage of Data by an Online Media Services Provider — National Legislation Not Permitting the Legitimate Interest Pursued by the Controller to Be Taken into Account). 2016. C-582/14.
Van Ravenzwaaij, Don, Marlon De Jong, Rink Hoekstra, et al. 2025. “De-Identification When Making Data Sets Findable, Accessible, Interoperable, and Eusable (FAIR): Two Worked Examples from the Behavioral and Social Sciences.” Advances in Methods and Practices in Psychological Science 8 (2): 1–23. https://doi.org/10.1177/25152459251336130.