Add Data and Data Dictionary

Add Data File

You can now download the data set we have prepared for you and put it into your project folder:

data.csv

palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data

The data set is from the package palmerpenguins (v0.1.1) and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available by Allison Horst, Alison Hill, and Kristen Gorman under the license CC0 1.0.

Add a Data Dictionary

Whether or not distributing the data set, it is important to document the meaning (e.g., units) and values of its variables. This is typically done with a data dictionary (also called a codebook).

The recommendation for data dictionaries vary between fields - both in terms of the recommended content (i.e., what exactly should be documented) and the technical implementation (i.e., which file formats should be used).

For the purpose of this exercise, we keep it easy and propose that you manually create a file with the data dictionary (e.g., as a table in .xlsx, .docx, .ods, or as a Markdown table), documenting only the bare minimum.

Tip 1: Manual is too much work? Automatic generation of (machine-readable) data dictionaries

Our in-depth supplementary material “Automatic Generation of Data Dictionaries” explains how you can automatically create a data dictionary with an R package. The package reads the data set and extracts relevant information from it. This approach is in particular useful if you need to document data sets with many variables.

This advanced chapter also contains a section on how to create machine-readable data dictionaries.

A bare minimum data dictionary

Most standards for data dictionaries require at least this information for each variable in your data set:

  • name: The (machine‑readable) name of the variable
  • label: A short, human‑readable title or label for the variable
  • type: The data type of the variable (e.g., integer, float, string, date)
  • description: A brief description of what the variable measures or represents
  • values (for categorical variables): A mapping of codes to their meanings, for example: 1 = Male, 2 = Female, 9 = Missing)
  • units (for numeric measures): The units of measurement (e.g., kg, USD, years, or the scale of a survey response),
  • missing_codes: Any special codes used to denote missing or non‑applicable values (e.g., -99 = Not answered)

Here’s an example from a different data set, with variables in rows, and the dictionary in columns:

name label type description values units missing_codes
gender Gender integer self‑identified gender 1 = Male; 2 = Female; 3 = Other; 9 = Missing 9 = Missing
age Age integer age in years years -99 = Not answered
blood_pressure Blood Pressure (systolic) integer systolic blood pressure mmHg -99 = Not measured
life_satisfaction Life Satisfaction integer “How satisfied are you with your life?” 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; 5 = Very satisfied scale (1–5) -99 = Not answered

✍️ Practical Exercise: Add your own data dictionary

Now go ahead and create a data dictionary for the penguins data set, in a software (text or spreadsheet) of your choice.

Save the data dictionary in the same folder as the actual data set file.

name label type description values units missing_codes
species Species string Penguin species Adelie; Gentoo; Chinstrap NA = Missing
island Island string Island where individual was observed Torgersen; Biscoe; Dream NA = Missing
bill_length_mm Bill length float Length of the bill (beak) mm NA = Missing
bill_depth_mm Bill depth float Depth of the bill (beak) mm NA = Missing
flipper_length_mm Flipper length integer Length of the flipper mm NA = Missing
body_mass_g Body mass integer Body mass g NA = Missing
sex Sex string Sex of the penguin male; female NA = Missing
year Year integer Year of data collection 2007; 2008; 2009

✍️ Practical Exercise: Add Data Citation and Attribution

All data relied upon should be cited in the manuscript to allow for precise identification and access. Now, it’s your turn to add an appropriate citation for the data set to the manuscript.

Hints:

  • You can find an appropriate BibTeX entry on the package website or with the function citation()2.
  • Add the citation in the manuscript where it says “cite data here”.

Show the correct reference of the data set:

citation("palmerpenguins", auto = TRUE) |>
  transform(key = "horst2020") |>
  toBibtex()
Bibliography.bib
@Manual{horst2020,
  title = {palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data},
  author = {Allison Horst and Alison Hill and Kristen Gorman},
  year = {2022},
  note = {R package version 0.1.1},
  url = {https://CRAN.R-project.org/package=palmerpenguins},
  doi = {10.32614/CRAN.package.palmerpenguins},
}

Copy the BibTeX entry to the file Bibliography.bib. Then, find the line in the manuscript that says “cite data here” and replace it with a sentence such as the following:

Manuscript.qmd
The analyzed data are by @horst2020.

Render the document to check that the citation is displayed properly.

Terminal
quarto render Manuscript.qmd

Wrap up

Congrats! You documented your data set and cited it correctly.

To finalize this step, you can go through the commit routine:

Terminal
git status
git add .
git commit -m "Add data"

References

Vilhuber, L. (2024, October 14). Creating reproducible packages when data are confidential. https://labordynamicsinstitute.github.io/reproducibility-confidential/; Zenodo. https://doi.org/10.5281/zenodo.13927702

Footnotes

  1. For example, using Amnesia, ARX, sdcTools, Synthpop, or OpenDP.↩︎

  2. Note that this function requires to have the respective package installed.↩︎