Introduction to Data Privacy

What is Data Privacy?

This question has troubled (and engaged) scholars in privacy research for decades - there is no one agreed-upon definition. One quite influential way to formalize privacy is Nissenbaum’s framework of Contextual Integrity (Nissenbaum 2010).

Consider a common scenario in psychological research: A PhD student recruits participants for a study on workplace stress. Participants complete a survey listing their employer, job title, and burnout symptoms - trusting that this information stays within the research context. The researcher publishes a dataset that is technically anonymized: no names, no email addresses. But the dataset includes employer, job title, age, and department. A participant’s manager downloads the open dataset and identifies her employee.

Nothing was “leaked” in a traditional sense. The information flowed - but it violated the norms of the context in which it was shared. This is exactly what Nissenbaum’s framework captures.

At its core, it states: Privacy means that contextual information norms are respected in all information flows.

A context carries certain information, such as norms in your culture, laws, the people involved, and the situation you are in.

An information flow means the act of transferring information in any way. Information norms depend on the sender, recipient, information type, and conditions of the information flow (e.g., voluntary, with permission, confidentially; Nissenbaum calls this transmission principle). It may be normative or descriptive.

NoteApplying Contextual Integrity

In the example above, nothing was “leaked” in a traditional sense — yet most of us would call this a privacy violation. Nissenbaum’s framework explains why:

  • Context: Academic research, governed by ethics board approval and norms of confidentiality
  • Sender: Survey participant (employee)
  • Recipient: Researcher → later, the broader public via open data (including, unexpectedly, the participant’s manager)
  • Information type: Sensitive personal disclosures (burnout, workplace stress)
  • Transmission principle: Voluntary, under expectation of anonymity and confidentiality

The information type and sender haven’t changed — but the recipient and transmission principle have. The contextual norm is violated. This is a privacy violation not because a name was exposed, but because information flowed in a way that didn’t match the norms of the context in which it was shared.

In this tutorial, I will show - why we - as researchers - need to carefully consider these contextual norms, and - how we can enable information flows of data (i.e., when sharing data).

Throughout this tutorial, I will use both the terms data protection and privacy. Privacy is both a broader term that encompasses values, attitudes, and behaviors, but is also used in the context of technical mechanisms. Data protection is a term that is closely linked to actual techniques and is used in laws. Both translate (give or take) to “Datenschutz” in German.

Why is Data Protection Important?

TipReflection

Take a moment to think about the following scenarios. For each one, consider: Would you share your data? What conditions would need to be met? Think about the role of the researcher, the institution, and the way your data is handled. You might want to jot down a few notes before reading on.

  1. You participate in an interview on research integrity as part of a study. The interviewer asks you about any ethical transgressions you may have committed during your work and about your mental health.

  2. You fill out an online survey about your political opinions and voting behavior. The survey also asks for your age, gender, postal code, and occupation.

  3. A research team asks you to share your browsing history for a study on information consumption. They promise that all data will be “anonymized” before publication.

Data protection matters because research depends on trust. Participants share personal, sometimes sensitive, information with us because they trust that we handle it responsibly. If that trust erodes - say, because a dataset gets re-identified or a data breach makes headlines - people become less willing to participate in research. And without participants, there is no empirical research.

As researchers, our obligation to data protection arises from two directions that overlap quite a bit: ethics and law.

Data Protection in Research Ethics

The ethical obligation to protect participants’ data is deeply rooted in research ethics. The Declaration of Helsinki (1964, most recently revised in 2024), one of the most widely recognized ethical guidelines for research involving human subjects, explicitly requires that “every precaution must be taken to protect the privacy of research subjects and the confidentiality of their personal information.” While originally developed for medical research, its principles have been adopted across the social and behavioral sciences.

Professional codes of conduct reinforce these obligations. Psychologists, for example, are bound by the Ethical Principles of Psychologists and Code of Conduct (APA) or corresponding national guidelines (e.g., the Code of Ethics of the German Psychological Society) that require the protection of confidential information. Similar codes exist for sociologists, medical professionals, and other disciplines that regularly work with personal data.

In practice, institutional review boards (IRBs) or ethics committees enforce these obligations before a study begins. However, their review typically focuses on the study design and informed consent. They rarely evaluate the technical details of how data will be stored, shared, or anonymized after collection. This gap is where the practical skills covered in this tutorial become relevant.

One more thing worth mentioning: We tend to assume that informed consent solves the privacy problem. But research consistently shows that participants rarely read consent forms carefully, and even when they do, they may not fully understand the implications of data sharing (Hallinan et al. 2023). Consent is necessary, but it is not sufficient. More on this in the chapter on mechanisms of data protection.

Data Protection in Law

Beyond ethics, data protection is also a legal requirement. Many countries and regions have enacted data protection laws, and as a researcher, you need to be aware of the ones that apply to you. The most relevant regulation for researchers working in or with data from the European Union is the General Data Protection Regulation (GDPR), which has been in effect since 2018. It applies whenever you process personal data of individuals in the EU, regardless of where you or your institution are located. We cover the GDPR in more detail in the next chapter.

This is especially important to understand: if your institution is based in the EU, or if you process data of people residing in the EU, the GDPR applies to you. There is no way around it.

Outside the EU, other frameworks exist. The US has sector-specific laws like HIPAA (for health data) and FERPA (for educational records), but no single comprehensive federal data protection law. Countries like Brazil (LGPD), Japan (APPI), and South Korea (PIPA) have their own comprehensive frameworks, many of which share core principles with the GDPR.

For this tutorial, we focus on the GDPR. It is one of the most stringent and widely applicable frameworks. The anonymization techniques we cover are relevant regardless of which specific law applies.

Acknowledge the existence of other relevant laws (e.g., export control of knowledge relating to national security; IP)

Conclusion

So where does this leave us? Both ethics and law tell us the same thing: we need to protect the personal data of our participants. At the same time, sharing data is a cornerstone of good scientific practice. Open data makes research more transparent, reproducible, and efficient. It enables replication, secondary analyses, and meta-research. Funders and journals increasingly require it, and for good reason.

These two goals - protecting privacy and sharing data - are not contradictory, but they do create a tension that needs to be managed thoughtfully. The guiding principle is: “as open as possible, as closed as necessary.” What exactly this means will depend on your data, your participants, and the context of your research. There is no one-size-fits-all answer.

The good news is that anonymization gives us the tools to navigate this tension. By learning how to properly anonymize data, you can share your research data openly while still protecting the people who made your research possible. That is what this tutorial is about.

Learning Objective

  • After completing this part of the tutorial, you will have a fundamental understanding of privacy and data protection.

Exercises

  • Reflection exercise regarding privacy risks
Back to top

References

Hallinan, Dara, Franziska Boehm, Annika Külpmann, and Malte Elson. 2023. “(Un)informed Consent in Psychological Research: An Empirical Study on Consent in Psychological Research and the GDPR.” Journal of Open Access to Law 11 (2). https://doi.org/10.63567/574eqr35.
Nissenbaum, Helen Fay. 2010. Privacy in context. Standford University Press.