Data Anonymization and Pseudonymisation

Anonymization and pseudonymization are distinct yet related approaches to data privacy, both aimed at protecting individual identities within datasets, but differing significantly in their level of identifiability and legal implications.The primary difference is that anonymization aims to make data non-identifiable, removing it from the scope of data protection regulations, whereas pseudonymization only reduces the ease of identification, meaning the data remains personal and subject to those regulations. Hence, pseudonymized data is still subject to the regulations of the GDPR, while anonymized data is not.

What is data Anonymisation?

Anonymization is defined as the process that removes the association between the identifying data and the data subject. It is an overarching term for everything done to protect individual identities in a dataset. The fundamental goal of anonymization is to transform personal data so thoroughly that it cannot be traced back to a particular individual so that the original source cannot be known (jarmul2023practical, el2013anonymizing).

True and full anpnymisation is actually really hard to achieve (see also differential privacy).

What is data Pseudonymisation?

Pseudonymization (rec.28)[https://gdpr-info.eu/recitals/no-28/] is a process by which “original data are replaced with false data”. It is characterized by “the use of a false name” or other replacement values instead of real, direct identifiers. The most critical distinction is that pseudonymized data is still personal information. The EU GDPR explicitly states that “personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person” (https://www.privacy-regulation.eu/en/recital-26-GDPR.htm). Hence, pseudonymized data is still personal data, and has to be treated as such.

Pseudonymized data “can be traced back to original data value” because “indirect identifiers that remain in pseudonymized data are known to pose a potential re-identification risk”. This re-identification often occurs via linkage attacks, where seemingly innocuous indirect identifiers (like age, ZIP code, and gender) are combined with external information to identify individuals.

The GDPR considers peudonymisation as a first step towards reducing re-identificaton.

What is Privacy?

ImportantAttention

While pseudonymized data is still subject to the GDPR, anonymized data is not.
However, until data is fully anonymized, it must be treated in accordance with the GDPR.

Exercises

Exercise: Anonymised, Pseudonymised, or Not Clear?

Scenario 1: Medical Records for Research

Pseudonymised — because the re-identification key still exists.


Scenario 2: City Council Survey

Anonymised — the data is aggregated, no individuals can be singled out.


Scenario 3: Fitness Tracker Data

Not clear — technically pseudonymised, but GPS traces might allow re-identification.


Scenario 4: University Exam Scores

Pseudonymised — student number acts as a pseudonym with a key in the system.


Scenario 5: Online Store Reviews

Not clear — identifiers are removed, but free text may reveal identity.


Scenario 6: Traffic Accident Database

Not clear / borderline anonymised — no key exists, but rare event details may still re-identify.


Scenario 7: Genetic Study

Pseudonymised — the lab can re-link using the mapping file.

Back to top