Mechanisms of Data Protection in Research

Overview

Arising from researchers’ ethical and legal obligations, there are several points of contact with data protection throughout the research process.

Show graph with research process and all mechanisms

List of mechanisms and stages:

Stage Mechanism
Plan Data Management Plan, IRB Approval, Data Privacy by Design
Collect Informed Consent, Data Minimization
Process Technical and Organizational Measures, Anonymization, Data Subject Rights
Analyze Technical and Organizational Measures, Anonymization, Data Subject Rights
Preserve Technical and Organizational Measures, Anonymization, Data Subject Rights
Publish Anonymization
Re-Use Data Minimization

Add references and links to each section

Data Management Plan

  • great measure, helps with data minimization and purpose orientation

Institutional Review Board (IRB) Approval

  • usually purely ethical

  • might also consider data protection, but are no experts in this area

Privacy by Design

Privacy by Design (PbD) is a foundational concept and a set of principles that require privacy protection measures to be proactively integrated into the design and architecture of systems, business practices, and technologies from the very beginning (Cavoukian 2011).

  • principle in GDPR

Move the seven PbD principles to a call-out box

PbD is guided by seven principles outlined below:

  1. Proactive and Preventive Approach: PbD mandates that researchers and data architects anticipate and prevent privacy-invasive events before they occur. For example, in the context of health care research, this means ensuring the integrity of patient data is secured from the moment it is collected.

  2. Privacy as the Default: Systems must be designed so that the maximum degree of privacy is delivered automatically. If a researcher or data subject takes no action, their privacy must remain intact at the maximum level.

  3. Balancing Utility and Privacy: PbD supports the objective of having full functionality—positive-sum, not zero-sum. Hence, privacy cannot be “sacrificed” for a greater benefit. This is critical in research, where the goal is to produce data of sufficient quality that the analytics are useful and meaningful. Instead, the goal is to find a sweet spot between information and privacy—enabling sufficient data utility for the purpose for which it was collected while preserving privacy for individuals.

Explain other principles

Data Minimization

  • principle in GDPR

  • no purposeless collection and storage of data (e.g., demographics)

  • call-out box on anonymous data collection

  • whenever possible: re-use data (no collection = best kind of data protection) and make own data re-usable (FAIR) (link to LMU OS tutorials)

Data minimization (or in German, Datensparsamkeit) is the principle of “No Data, No Risk” (or less data, less risk), meaning that the risk of privacy breaches decreases as the amount of data collected decreases. Data minimization pushes the data collector to consider only the minimum amount of data necessary to accomplish the task. Reducing the amount of stored data can also reduce the risk that data storage becomes a target of a privacy attack.

Discuss issues of data minimization and privacy by design (i.e., norms, contextualization, exploration, reviewers)

Include call-out box on randomized response technique?

Technical and Organizational Measures (TOMs)

  • list of measures in GDPR that build additional layers of protection around data

  • example: secure storage location

  • example: storage limitation

  • example: pseudonymization

Pseudonymization

Explain a bit better

Pseudonymization (rec.28)[https://gdpr-info.eu/recitals/no-28/%5D] is a process by which “original data are replaced with false data”. It is characterized by “the use of a false name” or other replacement values instead of real, direct identifiers. The most critical distinction is that pseudonymized data is still personal information. The EU GDPR explicitly states that “personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person” (https://www.privacy-regulation.eu/en/recital-26-GDPR.htm). Hence, pseudonymized data is still personal data, and has to be treated as such.

  • data are still personal data (GDPR applies to pseudonymized data), but adds an additional layer of protection

Pseudonymized data “can be traced back to original data value” because “indirect identifiers that remain in pseudonymized data are known to pose a potential re-identification risk”. This re-identification often occurs via linkage attacks, where seemingly innocuous indirect identifiers (like age, ZIP code, and gender) are combined with external information to identify individuals.

The GDPR considers peudonymisation as a first step towards reducing re-identificaton.

Anonymization

Anonymization and pseudonymization are distinct yet related approaches to data privacy, both aimed at protecting individual identities within datasets, but differing significantly in their level of identifiability and legal implications.The primary difference is that anonymization aims to make data non-identifiable, removing it from the scope of data protection regulations, whereas pseudonymization only reduces the ease of identification, meaning the data remains personal and subject to those regulations. Hence, pseudonymized data is still subject to the regulations of the GDPR, while anonymized data is not.

Anonymization is defined as the process that removes the association between the identifying data and the data subject. It is an overarching term for everything done to protect individual identities in a dataset. The fundamental goal of anonymization is to transform personal data so thoroughly that it cannot be traced back to a particular individual so that the original source cannot be known (jarmul2023practical, el2013anonymizing).

True and full anonymization is actually really hard to achieve (see also differential privacy).

  • process of deleting any relation to an identifiable person

  • when successful: GDPR does not apply anymore, no personal data anymore

  • depending on the data: complex process (focus of this tutorial)

  • strongest data protection mechanism?

  • does not necessarily solve all ethical mandates for data protection?

Data Subject Rights

  • access rights

  • data deletion/withdrawal of consent

Exercises

Exercise: Anonymised, Pseudonymised, or Not Clear?

Scenario 1: Medical Records for Research

Pseudonymised — because the re-identification key still exists.


Scenario 2: City Council Survey

Anonymised — the data is aggregated, no individuals can be singled out.


Scenario 3: Fitness Tracker Data

Not clear — technically pseudonymised, but GPS traces might allow re-identification.


Scenario 4: University Exam Scores

Pseudonymised — student number acts as a pseudonym with a key in the system.


Scenario 5: Online Store Reviews

Not clear — identifiers are removed, but free text may reveal identity.


Scenario 6: Traffic Accident Database

Not clear / borderline anonymised — no key exists, but rare event details may still re-identify.


Scenario 7: Genetic Study

Pseudonymised — the lab can re-link using the mapping file.

Learning Objective

  • After completing this part of the tutorial, you will know how data protection is integrated into the research process.
  • After completing this part of the tutorial, you will know the difference between anonymization and pseudonymization.

Exercises

  • Anonymization or Pseudonymization?

To Do List

  • create graphs
Back to top

References

Cavoukian, Ann. 2011. “Privacy by Design in Law, Policy and Practice. A White Paper for Regulators, Decision-Makers and Policy-Makers.” Ontario, Canada: Canadian Information and Privacy Commissioner.
Hallinan, Dara, Franziska Boehm, Annika Külpmann, and Malte Elson. 2023. “Information Provision for Informed Consent Procedures in Psychological Research Under the General Data Protection Regulation: A Practical Guide.” Advances in Methods and Practices in Psychological Science 6 (1). https://doi.org/10.1177/25152459231151944.