Add Data and Data Dictionary

Add Data File

You can now download the data set we have prepared for you and put it into your project folder:

palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data

The data set is from the package palmerpenguins (v0.1.1) and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available by Allison Horst, Alison Hill, and Kristen Gorman under the license CC0 1.0.

Important 1: Consider Legal Restrictions Before Sharing

Everything you put into the project folder will be shared publicly. For reasons of reproducibility, this should include the data you analyze. Of course, you should only share them to the extent you are allowed to. Besides copyright and similar rights, you need to take into account applicable privacy laws (e.g., the GDPR for European citizens) and contractual obligations (e.g., with your data provider).

Privacy laws and contractual obligations may require you to create an anonymized or synthetic data set¹, in which case you should provide a reference to a repository where the originally measured data can be obtained from. For further information, you can watch the talk “Data anonymity” by Felix Schönbrodt recorded during the LMU Open Science Center Summer School 2023 and have a look at the accompanying slides. Another resource to look at is the presentation “Creating reproducible packages when data are confidential” by Vilhuber (2024).

Add a Data Dictionary

Whether or not distributing the data set, it is important to document the meaning (e.g., units) and values of its variables. This is typically done with a data dictionary (also called a codebook).

The recommendation for data dictionaries vary between fields - both in terms of the recommended content (i.e., what exactly should be documented) and the technical implementation (i.e., which file formats should be used).

For the purpose of this exercise, we keep it easy and propose that you manually create a file with the data dictionary (e.g., as a table in .xlsx, .docx, .ods, or as a Markdown table), documenting only the bare minimum.

Tip 1: Manual is too much work? Automatic generation of (machine-readable) data dictionaries

Our in-depth supplementary material “Automatic Generation of Data Dictionaries” explains how you can automatically create a data dictionary with an R package. The package reads the data set and extracts relevant information from it. This approach is in particular useful if you need to document data sets with many variables.

This advanced chapter also contains a section on how to create machine-readable data dictionaries.

A bare minimum data dictionary

Most standards for data dictionaries require at least this information for each variable in your data set:

name: The (machine‑readable) name of the variable
label: A short, human‑readable title or label for the variable
type: The data type of the variable (e.g., integer, float, string, date)
description: A brief description of what the variable measures or represents
values (for categorical variables): A mapping of codes to their meanings, for example: 1 = Male, 2 = Female, 9 = Missing)
units (for numeric measures): The units of measurement (e.g., kg, USD, years, or the scale of a survey response),
missing_codes: Any special codes used to denote missing or non‑applicable values (e.g., -99 = Not answered)

Here’s an example from a different data set, with variables in rows, and the dictionary in columns:

name	label	type	description	values	units	missing_codes
gender	Gender	integer	self‑identified gender	1 = Male; 2 = Female; 3 = Other; 9 = Missing		9 = Missing
age	Age	integer	age in years		years	-99 = Not answered
blood_pressure	Blood Pressure (systolic)	integer	systolic blood pressure		mmHg	-99 = Not measured
life_satisfaction	Life Satisfaction	integer	“How satisfied are you with your life?”	1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; 5 = Very satisfied	scale (1–5)	-99 = Not answered

✍️ Practical Exercise: Add your own data dictionary

Now go ahead and create a data dictionary for the penguins data set, in a software (text or spreadsheet) of your choice.

Save the data dictionary in the same folder as the actual data set file.

Tip 2: Data dictionary for penguins data set (Solution)

name	label	type	description	values	units	missing_codes
species	Species	string	Penguin species	Adelie; Gentoo; Chinstrap		NA = Missing
island	Island	string	Island where individual was observed	Torgersen; Biscoe; Dream		NA = Missing
bill_length_mm	Bill length	float	Length of the bill (beak)		mm	NA = Missing
bill_depth_mm	Bill depth	float	Depth of the bill (beak)		mm	NA = Missing
flipper_length_mm	Flipper length	integer	Length of the flipper		mm	NA = Missing
body_mass_g	Body mass	integer	Body mass		g	NA = Missing
sex	Sex	string	Sex of the penguin	male; female		NA = Missing
year	Year	integer	Year of data collection	2007; 2008; 2009

✍️ Practical Exercise: Add Data Citation and Attribution

All data relied upon should be cited in the manuscript to allow for precise identification and access. Now, it’s your turn to add an appropriate citation for the data set to the manuscript.

Hints:

You can find an appropriate BibTeX entry on the package website or with the function citation()².
Add the citation in the manuscript where it says “cite data here”.

Tip 3: Citing the Data Set (Solution)

Show the correct reference of the data set:

citation("palmerpenguins", auto = TRUE) |>
  transform(key = "horst2020") |>
  toBibtex()

Bibliography.bib

@Manual{horst2020,
  title = {palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data},
  author = {Allison Horst and Alison Hill and Kristen Gorman},
  year = {2022},
  note = {R package version 0.1.1},
  url = {https://CRAN.R-project.org/package=palmerpenguins},
  doi = {10.32614/CRAN.package.palmerpenguins},
}

Copy the BibTeX entry to the file Bibliography.bib. Then, find the line in the manuscript that says “cite data here” and replace it with a sentence such as the following:

Manuscript.qmd

The analyzed data are by @horst2020.

Render the document to check that the citation is displayed properly.

Terminal

quarto render Manuscript.qmd

Wrap up

Congrats! You documented your data set and cited it correctly.

To finalize this step, you can go through the commit routine:

Terminal

git status
git add .
git commit -m "Add data"

References

Vilhuber, L. (2024, October 14). Creating reproducible packages when data are confidential. https://labordynamicsinstitute.github.io/reproducibility-confidential/; Zenodo. https://doi.org/10.5281/zenodo.13927702

Footnotes

For example, using Amnesia, ARX, sdcTools, Synthpop, or OpenDP.↩︎
Note that this function requires to have the respective package installed.↩︎