Mechanisms of Data Protection in Research
Overview
How can research protect (personal) data of participants throughout the research process? Data protection is not something that happens at a single point - it is relevant at every stage. The table below maps common data protection mechanisms to the stages of the research lifecycle:
| Stage | Mechanism |
|---|---|
| 1 Plan | Data Management Plan, IRB Approval, Privacy by Design |
| 2 Collect | Informed Consent, Data Minimization |
| 3 Process | Technical and Organizational Measures, Anonymization, Data Subject Rights |
| 4 Analyze | Technical and Organizational Measures, Anonymization, Data Subject Rights |
| 5 Preserve | Technical and Organizational Measures, Anonymization, Data Subject Rights |
| 6 Publish | Anonymization |
| 7 Re-Use |
Show graph with research process and all mechanisms
As you can see, some mechanisms (like anonymization) are relevant across many stages, while others (like informed consent) are specific to a particular point. Let’s walk through each stage.
Stage 1: Plan
Data Management Plan
A data management plan (DMP) is a document that describes how you will handle your data during and after your research project. It forces you to think about data protection early - what data will you collect, how will you store it, who will have access, and how will you share it? Many funders now require a DMP, and writing one is a good exercise in data minimization and purpose limitation.
A useful template for DMPs is available at Zenodo.
Institutional Review Board (IRB) Approval
Ethics committees or IRBs review your study before data collection begins. Their focus is primarily ethical - does the study design respect participants’ rights and well-being? They may also consider data protection aspects, but they are typically not technical experts in anonymization or data security. Think of IRB approval as a necessary but not sufficient step: it gets you started on the right track, but you still need to handle the technical side of data protection yourself.
Privacy by Design
Privacy by Design (PbD) is a set of principles that require privacy protection to be built into the design of systems and processes from the very beginning, rather than added as an afterthought (Cavoukian 2011). It is also enshrined as a principle in the GDPR (Art. 25).
For research, PbD means thinking about privacy before you start collecting data: What is the minimum data you need? How will you store it securely? When and how will you anonymize it? A detailed data management plan can be a great tool for putting PbD into practice.
Proactive, not reactive: Anticipate and prevent privacy risks before they occur. Don’t wait for a breach to think about data protection.
Privacy as the default: The maximum degree of privacy should be delivered automatically. If a participant or researcher takes no action, privacy should remain intact.
Privacy embedded into design: Privacy should be an integral part of your research design, not an add-on. Plan your data collection, storage, and sharing with privacy in mind from the start.
Full functionality - positive-sum, not zero-sum: Privacy and data utility are not opposites. The goal is to find approaches that serve both - enabling meaningful research while protecting individuals.
End-to-end security: Data should be protected throughout its entire lifecycle - from collection to deletion. This includes secure storage, encrypted transfers, and proper disposal.
Visibility and transparency: Keep your data protection practices open and documented. Participants and oversight bodies should be able to verify that privacy is being maintained.
Respect for user privacy: The interests of the individual should be kept central. In research, this means respecting participants’ autonomy, expectations, and rights regarding their data.
Stage 2: Collect
Informed Consent
Informed consent is both an ethical and a legal requirement. Under the GDPR, consent is one possible legal basis for processing personal data (Art. 6(1)(a)), and participants have the right to be informed about how their data will be used at the point of collection.
In practice, informed consent in research means telling participants what data you collect, why, how it will be stored, who will have access, and whether it will be shared or published. If you plan to anonymize and share the data, this should be stated in the consent form. A useful tutorial on writing GDPR-compliant consent forms for research is provided by Hallinan et al. (2023).
Consent forms are often not read or not understood - which is why consent alone is not a sufficient data protection mechanism.
Data Minimization
Data minimization (or in German, Datensparsamkeit) follows a simple principle: no data, no risk - or more realistically, less data, less risk. The GDPR requires that you only collect data that is “adequate, relevant and limited to what is necessary” for your purpose (Art. 5(1)(c)).
In research, this means resisting the temptation to collect demographics or other variables “just in case” or “because we always do.” Every additional variable you collect increases the risk of re-identification when the data is shared. Ask yourself: Do I actually need this variable to answer my research question? If not, don’t collect it.
There is also a flip side to data minimization: whenever possible, re-use existing data instead of collecting new data. No collection is the best kind of data protection. This also aligns with the FAIR principles - making your own data findable, accessible, interoperable, and reusable reduces the need for redundant data collection and makes everybody’s life easier. (link to LMU OS tutorials)
Discuss issues of data minimization and privacy by design (i.e., norms, contextualization, exploration, reviewers)
Include call-out box on randomized response technique?
Stages 3-6: Process, Analyze, Preserve, and Publish
Technical and Organizational Measures (TOMs)
The GDPR requires controllers to implement “appropriate technical and organisational measures” to protect personal data (Art. 32). In a research context, this includes:
- Secure storage: Keep data on encrypted drives or institutional servers with access controls, not on personal laptops or unencrypted USB sticks. At LMU, for example, the LRZ provides cloud storage and sync services that keep data within Germany. However, standard LRZ services are not approved for special-category personal data (e.g., health data) without additional agreements - check with the LMU data protection officer if you work with sensitive data.
- Access control: Only people who need the data for the research should have access. Use role-based permissions where possible.
- Storage limitation: Don’t keep personal data longer than necessary. Define retention periods in your data management plan and stick to them. I like to put a reminder in my calendar for dates I need to delete data.
- Transfer security: When sharing data with collaborators, use encrypted channels - not email attachments.
- Pseudonymization: Replace direct identifiers (like names) with artificial identifiers (like participant codes) (see below for more information).
These measures add layers of protection around your data while it is still in personal form. They do not replace anonymization, but they are essential complements.
Pseudonymization
Pseudonymization is the process of replacing direct identifiers (like names) with artificial identifiers (like participant codes). It is an important first step, but it is crucial to understand its limits.
The key point: pseudonymized data is still personal data. The GDPR explicitly states that “personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person” (Recital 26). As long as a key file exists that links participant codes back to real identities - or as long as indirect identifiers in the data could allow re-identification - the data remains personal, and the GDPR fully applies.
Pseudonymized data is vulnerable to re-identification, especially through linkage attacks. This is when someone combines indirect identifiers that remain in the dataset (like age, postal code, and gender) with external information to identify individuals. We discussed this in the chapter on personal data, and we will look at it more formally in the chapter on privacy concepts.
The GDPR considers pseudonymization a useful safeguard and a step in the right direction - but it is not enough if your goal is to share data openly.
Anonymization
Anonymization goes further than pseudonymization. The goal is to transform personal data so that individuals can no longer be identified - directly or indirectly - by any means reasonably likely to be used. When data is truly anonymized, it falls outside the scope of the GDPR (Recital 26), which means it can be shared, published, and re-used without the legal restrictions that apply to personal data.
This sounds straightforward, but in practice, true anonymization is hard to achieve. Simply removing names and email addresses is usually not enough - indirect identifiers (demographics, rare combinations of attributes, free-text responses) can still enable re-identification. This is why anonymization is not just a single step but a process that requires careful analysis of risks and the application of specific techniques.
The bulk of this tutorial is dedicated to teaching you these techniques and helping you assess whether your data is sufficiently anonymized for sharing. The key distinction to remember:
| Pseudonymization | Anonymization | |
|---|---|---|
| What it does | Replaces direct identifiers with codes | Removes or transforms data so that individuals cannot be identified |
| Key file exists? | Yes (or re-identification is otherwise possible) | No (re-identification is not reasonably possible) |
| Still personal data? | Yes | No |
| GDPR applies? | Yes | No |
Insert discussion that this does not necessarily solve all ethical mandates for data protection
Data Subject Rights
Even during processing and analysis, participants retain their rights under the GDPR. The most relevant ones for this stage are:
- Right to access: Participants can ask to see what data you hold about them. This is easier to fulfill if your data is well-organized and pseudonymized (so you can look up individual records).
- Right to erasure: Participants can request that their data be deleted, for example if they withdraw consent. Again, this requires that you can identify their records - another reason for careful pseudonymization during the active research phase.
Once data is fully anonymized, these rights no longer apply (since you can no longer identify whose data is whose). This is why it is important to handle withdrawal requests before anonymization.
Exercises
Exercise: Anonymized, Pseudonymized, or Not Clear?
Scenario 1: Medical Records for Research
Pseudonymized - because the re-identification key still exists.
Scenario 2: City Council Survey
Anonymized - the data is aggregated, no individuals can be singled out.
Scenario 3: Fitness Tracker Data
Not clear - technically pseudonymized (usernames are replaced), but GPS traces of daily routines are highly unique and might allow re-identification even without the key. This is a good example of how removing direct identifiers is not always enough.
Scenario 4: University Exam Scores
Pseudonymized - the student number acts as a pseudonym, and the university holds the key.
Scenario 5: Online Store Reviews
Not clear - identifiers have been removed, but free-text reviews may contain personal information (mentions of location, profession, or experiences) that could allow re-identification. Free text is notoriously difficult to fully anonymize.
Scenario 6: Traffic Accident Database
Not clear / borderline anonymized - no key exists, but the combination of rare event details (exact time, location, car model) may still allow re-identification, especially for unusual accidents that were covered by the media.
Scenario 7: Genetic Study
Pseudonymized - the lab can re-link the data using the mapping file. Additionally, genetic data is inherently identifying and receives special protection under Art. 9 GDPR.
Learning Objective
- After completing this part of the tutorial, you will know how data protection is integrated into the research process.
- After completing this part of the tutorial, you will know the difference between anonymization and pseudonymization.
Exercises
- Exercise on Differentiating Anonymization and Pseudonymization
Resources, Links, Examples
Reference to an instruction/checklist for data management plans (preferably by LMU)
Reference/call-out box to randomized response technique