palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data
The data set is from the package palmerpenguins (v0.1.1) and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available by Allison Horst, Alison Hill, and Kristen Gorman under the license CC0 1.0.
Important 1: Consider Legal Restrictions Before Sharing
Everything you put into the project folder will be shared publicly. For reasons of reproducibility, this should include the data you analyze. Of course, you should only share them to the extent you are allowed to. Besides copyright and similar rights, you need to take into account applicable privacy laws (e.g., the GDPR for European citizens) and contractual obligations (e.g., with your data provider).
Privacy laws and contractual obligations may require you to create an anonymized or synthetic data set1, in which case you should provide a reference to a repository where the originally measured data can be obtained from. For further information, you can watch the talk “Data anonymity” by Felix Schönbrodt recorded during the LMU Open Science Center Summer School 2023 and have a look at the accompanying slides. Another resource to look at is the presentation “Creating reproducible packages when data are confidential” by Vilhuber (2024).
Add a Data Dictionary
Whether or not distributing the data set, it is important to document the meaning (e.g., units) and values of its variables. This is typically done with a data dictionary (also called a codebook).
The recommendation for data dictionaries vary between fields - both in terms of the recommended content (i.e., what exactly should be documented) and the technical implementation (i.e., which file formats should be used).
For the purpose of this exercise, we keep it easy and propose that you manually create a file with the data dictionary (e.g., as a table in .xlsx, .docx, .ods, or as a Markdown table), documenting only the bare minimum.
Tip 1: Manual is too much work? Automatic generation of (machine-readable) data dictionaries
Our in-depth supplementary material “Automatic Generation of Data Dictionaries” explains how you can automatically create a data dictionary with an R package. The package reads the data set and extracts relevant information from it. This approach is in particular useful if you need to document data sets with many variables.
This advanced chapter also contains a section on how to create machine-readable data dictionaries.
A bare minimum data dictionary
Most standards for data dictionaries require at least this information for each variable in your data set:
name: The (machine‑readable) name of the variable
label: A short, human‑readable title or label for the variable
type: The data type of the variable (e.g., integer, float, string, date)
description: A brief description of what the variable measures or represents
values (for categorical variables): A mapping of codes to their meanings, for example: 1 = Male, 2 = Female, 9 = Missing)
units (for numeric measures): The units of measurement (e.g., kg, USD, years, or the scale of a survey response),
missing_codes: Any special codes used to denote missing or non‑applicable values (e.g., -99 = Not answered)
Here’s an example from a different data set, with variables in rows, and the dictionary in columns:
name
label
type
description
values
units
missing_codes
gender
Gender
integer
self‑identified gender
1 = Male; 2 = Female; 3 = Other; 9 = Missing
9 = Missing
age
Age
integer
age in years
years
-99 = Not answered
blood_pressure
Blood Pressure (systolic)
integer
systolic blood pressure
mmHg
-99 = Not measured
life_satisfaction
Life Satisfaction
integer
“How satisfied are you with your life?”
1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; 5 = Very satisfied
scale (1–5)
-99 = Not answered
✍️ Practical Exercise: Add your own data dictionary
Now go ahead and create a data dictionary for the penguins data set, in a software (text or spreadsheet) of your choice.
Save the data dictionary in the same folder as the actual data set file.
Tip 2: Data dictionary for penguins data set (Solution)
name
label
type
description
values
units
missing_codes
species
Species
string
Penguin species
Adelie; Gentoo; Chinstrap
NA = Missing
island
Island
string
Island where individual was observed
Torgersen; Biscoe; Dream
NA = Missing
bill_length_mm
Bill length
float
Length of the bill (beak)
mm
NA = Missing
bill_depth_mm
Bill depth
float
Depth of the bill (beak)
mm
NA = Missing
flipper_length_mm
Flipper length
integer
Length of the flipper
mm
NA = Missing
body_mass_g
Body mass
integer
Body mass
g
NA = Missing
sex
Sex
string
Sex of the penguin
male; female
NA = Missing
year
Year
integer
Year of data collection
2007; 2008; 2009
✍️ Practical Exercise: Add Data Citation and Attribution
All data relied upon should be cited in the manuscript to allow for precise identification and access. Now, it’s your turn to add an appropriate citation for the data set to the manuscript.
Hints:
You can find an appropriate BibTeX entry on the package website or with the function citation()2.
Add the citation in the manuscript where it says “cite data here”.
Tip 3: Citing the Data Set (Solution)
Show the correct reference of the data set:
citation("palmerpenguins", auto =TRUE) |>transform(key ="horst2020") |>toBibtex()
Bibliography.bib
@Manual{horst2020,title = {palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data},author = {Allison Horst and Alison Hill and Kristen Gorman},year = {2022},note = {R package version 0.1.1},url = {https://CRAN.R-project.org/package=palmerpenguins},doi = {10.32614/CRAN.package.palmerpenguins},}
Copy the BibTeX entry to the file Bibliography.bib. Then, find the line in the manuscript that says “cite data here” and replace it with a sentence such as the following:
Manuscript.qmd
The analyzed data are by @horst2020.
Render the document to check that the citation is displayed properly.
Terminal
quarto render Manuscript.qmd
Wrap up
Congrats! You documented your data set and cited it correctly.
To finalize this step, you can go through the commit routine: