Note: This is an add-on to the Chapter “Add Data and Data Dictionary”. It describes how you can (a) automatically generate data dictionaries with an R package, and (b) how to create a machine readable documentation of your data.
Automatic Generation of Data Dictionaries
First, we will demonstrate how to create a simple data dictionary using the R package datawizard. We will use the penguin data set which is introduced in the Chapter “Add Data and Data Dictionary”. You can download it and put it into your project folder:
You can install the datawizard package into our renv environment using:
Console
renv::install("datawizard")
We create a separate Quarto file for the data dictionary. Create it by clicking on File > New File > Quarto Document…. Choose a title such as Data Dictionary, select HTML as format, uncheck the use of the visual markdown editor, and click on Create. Remove everything except the YAML header (between the ---). To make the HTML file self-contained, also set embed-resources: true such that the YAML header looks as follows:
Then, save it as data_dictionary.qmd by clicking on File > Save.
To create the actual data dictionary, first write a description for all columns so others can understand what the variable names mean. Where necessary, also document their value – this is especially important if their meaning is non-obvious. In the following, we demonstrate this by storing the penguins’ binomial name along with the English name.
data_dictionary.qmd
```{r}#| echo: false# Store the description of variablesvars <-c(species ="a character string denoting penguin species",island ="a character string denoting island in Palmer Archipelago, Antarctica",bill_length_mm ="a number denoting bill length (millimeters)",bill_depth_mm ="a number denoting bill depth (millimeters)",flipper_length_mm ="an integer denoting flipper length (millimeters)",body_mass_g ="an integer denoting body mass (grams)",sex ="a character string denoting penguin sex",year ="an integer denoting the study year")# Store the description of variable valuesvals <-list(species =c(Adelie ="Pygoscelis adeliae",Gentoo ="Pygoscelis papua",Chinstrap ="Pygoscelis antarcticus" ))```
Then, load the data and use datawizard to add the descriptions to the data.frame:1
datawizard: Easy Data Wrangling and Statistical Transformations
Then, you can create the data dictionary containing the descriptions, but also some other information about each variable (e.g., the number of missing values) and print it.
a character string denoting island in Palmer Archipelago, Antarctica
character
0 (0.0%)
Biscoe
168 (48.8%)
Dream
124 (36.0%)
Torgersen
52 (15.1%)
bill_length_mm
a number denoting bill length (millimeters)
numeric
2 (0.6%)
[32.1, 59.6]
342
bill_depth_mm
a number denoting bill depth (millimeters)
numeric
2 (0.6%)
[13.1, 21.5]
342
flipper_length_mm
an integer denoting flipper length (millimeters)
integer
2 (0.6%)
[172, 231]
342
body_mass_g
an integer denoting body mass (grams)
integer
2 (0.6%)
[2700, 6300]
342
sex
a character string denoting penguin sex
character
11 (3.2%)
female
165 (49.5%)
male
168 (50.5%)
year
an integer denoting the study year
integer
0 (0.0%)
2007
110 (32.0%)
2008
114 (33.1%)
2009
120 (34.9%)
Depending on the type of data, it may also be necessary to describe sampling procedures (e.g., selection criteria), measurement instruments (e.g., questionnaires), appropriate weighting, already applied preprocessing steps, or contact information. In our case, as the data has already been published, we only store a reference to its source.
The data set is from the R package palmerpenguins. If you had it installed you could use the function citation() to create such a reference:
citation("palmerpenguins", auto =TRUE) |>format(bibtex =FALSE, style ="text")
Without the package palmerpenguins installed, you can find a suggested citation on its website and add that to your data dictionary:
data_dictionary.qmd
Horst A, Hill A, Gorman K (2022). _palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data_. doi:10.32614/CRAN.package.palmerpenguins <https://doi.org/10.32614/CRAN.package.palmerpenguins>, R package version 0.1.1, <https://CRAN.R-project.org/package=palmerpenguins>.
Finally, you can render the data dictionary by running the following:
Terminal
quarto render data_dictionary.qmd
This should create the file data_dictionary.html which you open and view in your web browser.
If you want to learn more about the sharing of research data, have a look at the tutorial “FAIR research data management”.
Create Machine-Readable Variable Documentation
One could go even further by making the information machine-readable in a standardized way.
This section demonstrates how the title and description of the data set, the description of the variables and their valid values are stored in a machine-readable way. We’ll reuse the descriptions we already created2 and add a few others.
First, store the title and description of the data set as a whole:
Console
table_info <-c(title ="penguins data set",description ="Size measurements for adult foraging penguins near Palmer Station, Antarctica")
As before, also provide a reference to the source.
Finally, store the descriptions of the variables we already created earlier:
Console
# Store the description of variablesvars <-c(species ="a character string denoting penguin species",island ="a character string denoting island in Palmer Archipelago, Antarctica",bill_length_mm ="a number denoting bill length (millimeters)",bill_depth_mm ="a number denoting bill depth (millimeters)",flipper_length_mm ="an integer denoting flipper length (millimeters)",body_mass_g ="an integer denoting body mass (grams)",sex ="a character string denoting penguin sex",year ="an integer denoting the study year")
Generally, metadata are either stored embedded into the data or externally, for example, in a separate file. We will use the “frictionless data” standard, where metadata are stored separately. Another alternative would be RO-Crate.
Specifically, one can use the R package frictionless to create a schema which describes the structure of the data.3 For the purpose of the following code, it is just a nested list that we edit to include our own information. We also explicitly record in the schema that missing values are stored in the data file as NA and that the data are licensed under CC0 1.0. Finally, the package is used to create a metadata file that contains the schema.
Console
# Install {frictionless} and the required dependency {stringi}renv::install(c("frictionless","stringi"))# Read data and create schemadat_filename <-"data.csv"dat <-read.csv(dat_filename)dat_schema <- frictionless::create_schema(dat)# Add descriptions to the fieldsdat_schema$fields <-lapply(dat_schema$fields, \(x) {c(x, description = vars[[x$name]])})# Record valid valuesdat_schema$fields <-lapply(dat_schema$fields, \(x) {if (x[["name"]] %in%names(valid_vals)) {modifyList(x, list(constraints =list(enum = valid_vals[[x$name]]))) } else { x }})# Define missing valuesdat_schema$missingValues <-c("", "NA")# Create package with license info and write itdat_package <- frictionless::create_package() |> frictionless::add_resource(resource_name ="penguins",data = dat_filename,schema = dat_schema,title = table_info[["title"]],description = table_info[["description"]],licenses =list(list(name ="CC0-1.0",path ="https://creativecommons.org/publicdomain/zero/1.0/",title ="CC0 1.0 Universal" )),sources =list(list(title ="CRAN",path = dat_source )) )frictionless::write_package(dat_package, directory =".")
This creates the metadata file datapackage.json in the current directory. Make sure it is located in the same folder as data.csv, as together they comprise a data package.
Footnotes
Note that the code provided does not alter the data file – no description will be added to data.csv. The descriptions are only added to a (temporary) copy of the data set within R to create the data dictionary.↩︎
Unfortunately, the descriptions of values are not reused in this example, as they are not supported by the specification we are using.↩︎
In June 2024, version 2 of the frictionless data standard has been released. As of November 2024, the R package frictionless only supports the first version, though support for v2 is planned.↩︎