Add Data

Add Data File

You can now download the data set we have prepared for you and put it into your project folder:

data.csv

palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data

The data set is from the package palmerpenguins (v0.1.1) and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available by Allison Horst, Alison Hill, and Kristen Gorman under the license CC0 1.0.

Add Data Dictionary

Whether or not distributing the data set, it is important to document the meaning (e.g., units) and values of its variables. This is typically done with a data dictionary (also called a codebook). In the following, we will demonstrate how to create a simple data dictionary using the R package datawizard. You can install them now using:

Console
renv::install("datawizard")

You can put the code that follows in a separate document. Create it by clicking on File > New File > Quarto Document…. Choose a title such as Data Dictionary, select HTML as format, uncheck the use of the visual markdown editor, and click on Create. Remove everything except the YAML header (between the ---). To make the HTML file self-contained, also set embed-resources: true such that the YAML header looks as follows:

data_dictionary.qmd
---
title: "Data Dictionary"
format:
  html:
    embed-resources: true
---

Then, save it as data_dictionary.qmd by clicking on File > Save.

To create the actual data dictionary, first write a description for all columns so others can understand what the variable names mean. Where necessary, also document their value – this is especially important if their meaning is non-obvious. In the following, we demonstrate this by storing the penguins’ binomial name along with the English name.

data_dictionary.qmd
```{r}
#| echo: false

# Store the description of variables
vars <- c(
  species = "a character string denoting penguin species",
  island = "a character string denoting island in Palmer Archipelago, Antarctica",
  bill_length_mm = "a number denoting bill length (millimeters)",
  bill_depth_mm = "a number denoting bill depth (millimeters)",
  flipper_length_mm = "an integer denoting flipper length (millimeters)",
  body_mass_g = "an integer denoting body mass (grams)",
  sex = "a character string denoting penguin sex",
  year = "an integer denoting the study year"
)

# Store the description of variable values
vals <- list(
  species = c(
    Adelie = "Pygoscelis adeliae",
    Gentoo = "Pygoscelis papua",
    Chinstrap = "Pygoscelis antarcticus"
  )
)
```

Then, load the data and use datawizard to add the descriptions to the data.frame:2

datawizard: Easy Data Wrangling and Statistical Transformations
data_dictionary.qmd
```{r}
#| echo: false

dat <- read.csv("data.csv")

for (x in names(vars)) {
  if (x %in% names(vals)) {
    dat <- datawizard::assign_labels(
      dat,
      select = I(x),
      variable = vars[[x]],
      values = vals[[x]]
    )
  } else {
    dat <- datawizard::assign_labels(
      dat,
      select = I(x),
      variable = vars[[x]]
    )
  }
}
```

Then, you can create the data dictionary containing the descriptions, but also some other information about each variable (e.g., the number of missing values) and print it.

data_dictionary.qmd
```{r}
#| echo: false
#| column: "body-outset"
#| classes: plain

datawizard::data_codebook(dat) |>
  datawizard::data_select(exclude = ID) |>
  datawizard::data_filter(N != "") |>
  datawizard::print_md()
```
dat (344 rows and 8 variables, 8 shown)
Name Label Type Missings Values Value Labels N
species a character string denoting penguin species character 0 (0.0%) Adelie Pygoscelis adeliae 152 (44.2%)
Chinstrap Pygoscelis antarcticus 68 (19.8%)
Gentoo Pygoscelis papua 124 (36.0%)
island a character string denoting island in Palmer Archipelago, Antarctica character 0 (0.0%) Biscoe 168 (48.8%)
Dream 124 (36.0%)
Torgersen 52 (15.1%)
bill_length_mm a number denoting bill length (millimeters) numeric 2 (0.6%) [32.1, 59.6] 342
bill_depth_mm a number denoting bill depth (millimeters) numeric 2 (0.6%) [13.1, 21.5] 342
flipper_length_mm an integer denoting flipper length (millimeters) integer 2 (0.6%) [172, 231] 342
body_mass_g an integer denoting body mass (grams) integer 2 (0.6%) [2700, 6300] 342
sex a character string denoting penguin sex character 11 (3.2%) female 165 (49.5%)
male 168 (50.5%)
year an integer denoting the study year integer 0 (0.0%) 2007 110 (32.0%)
2008 114 (33.1%)
2009 120 (34.9%)

Depending on the type of data, it may also be necessary to describe sampling procedures (e.g., selection criteria), measurement instruments (e.g., questionnaires), appropriate weighting, already applied preprocessing steps, or contact information. In our case, as the data has already been published, we only store a reference to its source.

The data set is from the R package palmerpenguins. If you had it installed you could use the function citation() to create such a reference:

citation("palmerpenguins", auto = TRUE) |>
  format(bibtex = FALSE, style = "text")

Without the package palmerpenguins installed, you can find a suggested citation on its website and add that to your data dictionary:

data_dictionary.qmd
Horst A, Hill A, Gorman K (2022). _palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data_. R package version 0.1.1, https://github.com/allisonhorst/palmerpenguins, <https://allisonhorst.github.io/palmerpenguins/>.

Finally, you can render the data dictionary by running the following:

Terminal
quarto render data_dictionary.qmd

This should create the file data_dictionary.html which you open and view in your web browser.

One could go even further by making the information machine-readable in a standardized way. We provide an optional example of that in Note 1. If you want to learn more about the sharing of research data, have a look at the tutorial “FAIR research data management”.

This example demonstrates how the title and description of the data set, the description of the variables and their valid values are stored in a machine-readable way. We’ll reuse the descriptions we already created3 and add a few others.

First, store the title and description of the data set as a whole:

Console
table_info <- c(
  title = "penguins data set",
  description = "Size measurements for adult foraging penguins near Palmer Station, Antarctica"
)

As before, also provide a reference to the source.

Console
dat_source <- "https://allisonhorst.github.io/palmerpenguins/"

Next, create a list of the categorical variables’ valid values:

Console
valid_vals <- list(
  species = c("Adelie", "Gentoo", "Chinstrap"),
  island = c("Torgersen", "Biscoe", "Dream"),
  sex = c("male", "female"),
  year = c(2007, 2008, 2009)
)

Finally, store the descriptions of the variables we already created earlier:

Console
# Store the description of variables
vars <- c(
  species = "a character string denoting penguin species",
  island = "a character string denoting island in Palmer Archipelago, Antarctica",
  bill_length_mm = "a number denoting bill length (millimeters)",
  bill_depth_mm = "a number denoting bill depth (millimeters)",
  flipper_length_mm = "an integer denoting flipper length (millimeters)",
  body_mass_g = "an integer denoting body mass (grams)",
  sex = "a character string denoting penguin sex",
  year = "an integer denoting the study year"
)

Generally, metadata are either stored embedded into the data or externally, for example, in a separate file. We will use the “frictionless data” standard, where metadata are stored separately. Another alternative would be RO-Crate.

Specifically, one can use the R package frictionless to create a schema which describes the structure of the data.4 For the purpose of the following code, it is just a nested list that we edit to include our own information. We also explicitly record in the schema that missing values are stored in the data file as NA and that the data are licensed under CC0 1.0. Finally, the package is used to create a metadata file that contains the schema.

Console
# Install {frictionless} and the required dependency {stringi}
renv::install(c(
  "frictionless",
  "stringi"
))

# Read data and create schema
dat_filename <- "data.csv"
dat <- read.csv(dat_filename)
dat_schema <- frictionless::create_schema(dat)

# Add descriptions to the fields
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
  c(x, description = vars[[x$name]])
})

# Record valid values
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
  if (x[["name"]] %in% names(valid_vals)) {
    modifyList(x, list(constraints = list(enum = valid_vals[[x$name]])))
  } else {
    x
  }
})

# Define missing values
dat_schema$missingValues <- c("", "NA")

# Create package with license info and write it
dat_package <- frictionless::create_package() |>
  frictionless::add_resource(
    resource_name = "penguins",
    data = dat_filename,
    schema = dat_schema,
    title = table_info[["title"]],
    description = table_info[["description"]],
    licenses = list(list(
      name = "CC0-1.0",
      path = "https://creativecommons.org/publicdomain/zero/1.0/",
      title = "CC0 1.0 Universal"
    )),
    sources = list(list(
      title = "CRAN",
      path = dat_source
    ))
  )
frictionless::write_package(dat_package, directory = ".")

This creates the metadata file datapackage.json in the current directory. Make sure it is located in the same folder as data.csv, as together they comprise a data package.

Having added the data and its documentation, one can view and record the utilized packages with renv, thus bringing the project into a consistent state:

Console
renv::status()
renv::snapshot()

Add Data Citation and Attribution

All data relied upon should be cited in the manuscript to allow for precise identification and access. From the “eight core principles of data citation” by Starr et al. (2015), licensed under CC0 1.0:

Principle 1 – Importance: “Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.”

Principle 3 – Evidence: “In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.”

Principle 5 – Access: “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.”

Principle 7 – Specificity and Verifiability: “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.”

Now, it’s your turn to add an appropriate citation for the data set to the manuscript. Does your citation adhere to the principles above?

You can find an appropriate BibTeX entry on the package website or with the function citation():5

citation("palmerpenguins", auto = TRUE) |>
  transform(key = "horst2020") |>
  toBibtex()
Bibliography.bib
@Manual{horst2020,
  title = {palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data},
  author = {Allison Horst and Alison Hill and Kristen Gorman},
  year = {2022},
  note = {R package version 0.1.1, 
https://github.com/allisonhorst/palmerpenguins},
  url = {https://allisonhorst.github.io/palmerpenguins/},
}

Copy the BibTeX entry to the file Bibliography.bib. Then, find the line in the manuscript that says “cite data here” and replace it with a sentence such as the following:

Manuscript.qmd
The analyzed data are by @horst2020.

Render the document to check that the citation is displayed properly.

Terminal
quarto render Manuscript.qmd

While citation happens in the manuscript for reasons of academic integrity and reproducibility, to comply with any licenses you also may need to provide attribution within your project folder. Even though the data file we use here does not require attribution, we recommend adding a short paragraph to LICENSE.txt:

LICENSE.txt
The penguins data stored in "data.csv" by Allison Horst, Alison Hill, and Kristen Gorman available from <https://allisonhorst.github.io/palmerpenguins/> are licensed under CC0 1.0: <https://creativecommons.org/publicdomain/zero/1.0/>

As before, if the license required adding the full license text, you would also need to copy it to the project folder (if not already in there).

Finally, you can go through the commit routine:

Terminal
git status
git add .
git commit -m "Add data"

References

Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., Haak, L. L., Haendel, M., Herman, I., Hodson, S., Hourclé, J., Kratz, J. E., Lin, J., Nielsen, L. H., Nurnberger, A., Proell, S., Rauber, A., Sacchi, S., Smith, A., … Clark, T. (2015). Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science, 1, e1. https://doi.org/10.7717/peerj-cs.1
Vilhuber, L. (2024). Creating reproducible packages when data are confidential (Version v20240913) [Computer software]. https://labordynamicsinstitute.github.io/reproducibility-confidential/; Zenodo. https://doi.org/10.5281/ZENODO.13927702

Footnotes

  1. For example, using Amnesia, ARX, sdcTools, or Synthpop.↩︎

  2. Note that the code provided does not alter the data file – no description will be added to data.csv. The descriptions are only added to a (temporary) copy of the data set within R to create the data dictionary.↩︎

  3. Unfortunately, the descriptions of values are not reused in this example, as they are not supported by the specification we are using.↩︎

  4. In June 2024, version 2 of the frictionless data standard has been released. As of November 2024, the R package frictionless only supports the first version, though support for v2 is planned.↩︎

  5. Note that this function requires to have the respective package installed.↩︎