2) Documentation

Overview

Effective documentation is a crucial aspect of FAIR data management, ensuring that research data are not only well-organized but also easily findable, accessible, and reusable by others. In this section, we’ll delve into the importance of documentation by exploring how to create best practice documentation that supports the entire research data life cycle. We’ll cover the essential components like including metadata that provide context and description of the dataset, as well as README files that offer a concise introduction to the dataset. Additionally, we’ll discuss the role of codebooks for making all components of the dataset self-explanatory.

Metadata

Metadata are “data that provide information about other data”¹, describing the context, content, and structure of a dataset. They contain essential information about the dataset, such as authorship, creation context, contents, provenance and accessibility. Metadata enable to understand and work with the data effectively. By providing a clear understanding of the data’s origin, and meaning, metadata play a critical role in ensuring data quality, reproducibility, and FAIRness.

Essential questions that should be answered by metadata are for example:

Who created the dataset?
What do the data files contain?
When was the dataset generated?
Where was the dataset generated?
Why was the dataset generated?
How was the dataset generated?
What research is this dataset connected to?
Can this dataset be reused?

Tips on where to find metadata

Metadata can be found at several places:

Related publication: Most published datasets are accompanied by scientific papers that describe in detail what was done to create the dataset, what it consists of, and how it was analysed. While this can provide very detailed information, it makes it difficult to evaluate the dataset on its own. Ideally, you should provide enough metadata separate from your paper to allow others to assess whether they could reuse it.
Repository description: You will often be asked to provide metadata when you upload your dataset to a repository. The quality and amount of metadata you can provide varies from repository to repository. We covered repositories in Chapter 1), but remember that in most cases it is your responsibility to fill in the metadata fields as accurately and concisely as possible.
README file: Even before you publish your dataset, you should keep track of your metadata. This will prevent you from losing track of it, and will help inform others. A good way to do this is to create a README file. We look at README files in the next section.

Task 2.1: (~10 min)

Have a look at the repository of your (example) dataset.
Discuss: What metadata fields are provided here? And are all of them filled out (properly)?

Concentrate on the three following questions:

When was the dataset generated?
Where was the dataset generated?
Why was the dataset generated?

Could you find this information on the level of the repository?
Was everything readily understandable?

Solution: Example 1

All metadata questions have been answered here and different sources have been brought together.

Who created the dataset?
- Göckede, Mathias; Reum, Friedemann; Heimann, Martin
What do the data files contain?
- Atmospheric mixing ratios of CO2 and CH4 at Ambarchik, NE Siberia,
- Tabular data about measurements taken at the meteorological station for each month.
When was the dataset generated?
- The data were collected between August 2014 and December 2017.
- The dataset was published in 2023
Where was the dataset generated?
- At the meteorological station in Ambarchik. Ambarchik is located at the mouth of the Kolyma River, which opens to the East Siberian Sea (69.62N, 162.30E).
Why was the dataset generated?
- Unclear from the dataset alone
- From the related publication: Sparse data coverage in the Arctic hampers our understanding of its carbon cycle dynamics and our predictions of the fate of its vast carbon reservoirs in a changing climate. In this paper, we present accurate measurements of atmospheric carbon dioxide (CO2) and methane (CH4) dry air mole fractions at the new atmospheric carbon observation station Ambarchik, which closes a large gap in the atmospheric trace gas monitoring network in northeastern Siberia.
How was the dataset generated?
- The atmospheric carbon observation station Ambarchik consists of a 27 m tall tower with two air inlets and meteorological measurements, while the majority of the instrumentation is hosted in a rack inside a building. Atmospheric mole fractions of CH4 , CO2 , and H2O are measured by an analyzer based on the cavity ring-down spectroscopy (CRDS) technique (G2301, Picarro Inc.), which is calibrated against WMO-traceable reference gases at regular intervals.
What research is this dataset connected to?
- Reum, F., Göckede, M., Lavric, J. V., Kolle, O., Zimov, S., Zimov, N., Pallandt, M., and Heimann, M.: Accurate measurements of atmospheric carbon dioxide and methane mole fractions at the Siberian coastal site Ambarchik, Atmos. Meas. Tech., 12, 5717-5740, 2019.
- https://doi.org/10.5194/amt-12-5717-2019 (Not provided on dataset level)
Can this dataset be reused?
- Yes, under the conditions of the CC BY 4.0 License

Solution: Example 2

All metadata questions have been answered here and different sources have been brought together.

Who created the dataset?
- Tekles, Alexander; Auspurg, Katrin; Bornmann, Lutz
What do the data files contain?
- This repository contains the data and code that was used for generating the results presented in the paper “Same-gender citations do not indicate a substantial gender homophily bias”.
- Further information is unclear.
When was the dataset generated?
- The dataset was uploaded on February 16th 2022
- No information about the time frame of the data creation on the dataset level.
Where was the dataset generated?
- Not indicated on dataset level
Why was the dataset generated?
- Not indicated on dataset level
- From the related publication: When controlling for research topics at a high level of granularity, there is only little evidence for a gender homophily bias in citation decisions. Our study points out the importance of controlling structural aspects such as gendered specialization in research topics when investigating gender bias in science.
How was the dataset generated?
- Not indicated on dataset level
- From the related publication: The bibliometric data used in this paper are from an in-house database developed and maintained in cooperation with the Max Planck Digital Library (MPDL, Munich) and derived from the Science Citation Index Expanded (SCI-E), Social Sciences Citation Index (SSCI), Arts and Humanities Citation Index (AHCI) prepared by Clarivate Analytics (Philadelphia, Pennsylvania, USA).
What research is this dataset connected to?
- Article: “Same-gender citations do not indicate a substantial gender homophily bias”
- https://doi.org/10.1371/journal.pone.0274810 (Not provided on dataset level)
Can this dataset be reused?
- The dataset is under a CC BY-NC-SA 4.0 License, which makes its reuse highly restricted.

Solution: Example 3

All metadata questions have been answered here and different sources have been brought together.

Who created the dataset?
- Sarah Polk; Maike Kleemeyer (0000-0002-9388-5535)
What do the data files contain?
- This repository contains data and code for the manuscript entitled “Reliability of quantitative multiparameter maps is high for MT and PD but attenuated for R1 and R2* in healthy young adults”
- The data folder contains the extracted data from the 11 different regions of interest (gray matter, white matter, pars triangularis of the inferior frontal gyrus, pars opercularis of the inferior frontal gyrus, orbitofrontal cortex, anterior cingulate cortex, precuneus, poster middle temporal gyrus, caudate, putamen, and pallidum) for each parameter map (MTsat, PD, R1, R2*) for each day (one and two) and each session (1 and 2). Regions of interest were taken from the Harvard-Oxford atlas as included in FSL 5.0.
When was the dataset generated?
- The data repository was created on November 3, 2021 and last modified on March 12, 2022
- Further information on the time frame of the data creation is unclear on dataset level.
Where was the dataset generated?
- Not indicated on dataset level.
Why was the dataset generated?
- Not indicated on dataset level.
- From the related publication: In sum, in this sample of healthy young adults, the four MPM parameters showed very little variability over four measurements over two days but differed in how well they could assess between-person differences. We conclude that R1 and R2* might carry only limited person-specific information in samples of healthy young adults, and, by implication, might be of restricted utility for studying associations to between-person differences in behavior.
How was the dataset generated?
- Not indicated on dataset level.
- From the related publication: We acquired data from 15 volunteers, who each were assessed four times. On Day 1, participants were measured twice back-to-back, without repositioning between MPM datasets. On Day 2, participants were also measured twice, but this time with a break in between measurements, which afforded repositioning of the participants’ head.
What research is this dataset connected to?
- Article: “Reliability of quantitative multiparameter maps is high for MT and PD but attenuated for R1 and R2* in healthy young adults”
- https://doi.org/10.1101/2021.11.10.467254 (Not provided on dataset level)
Can this dataset be reused?
- The provided dataset does not have a license. Re-use is therefore unclear.

Solution: Example 4

All metadata questions have been answered here and different sources have been brought together.

Who created the dataset?
- Dominik Deffner; Benjamin (Kahl)
What do the data files contain?
- Not indicated on dataset level.
- From file structure: Data from 40 Group sessions and 40 solo sessions with relating scripts.
- From related Github repository: This repository contains the full data and scripts to reproduce all analyses and figures as well as the experimental game itself used in Deffner, D., Mezey, D., Kahl, B., Schakowski, A., Romanczuk, P., Wu, C. & Kurvers, R. (accepted, in principle, at Nature Communications): “Collective incentives reduce over-exploitation of social information in unconstrained human groups”
When was the dataset generated?
- From file structure: Sessions were between July 5th, 2022 and September 21, 2022.
- The dataset was published on February 12, 2024.
Where was the dataset generated?
- Not indicated on dataset level.
- From the related publication: At the Max Planck Institute for Human Development
Why was the dataset generated?
- Not indicated on dataset level.
- From the related publication: Collective dynamics emerge from countless individual decisions. Yet, we poorly understand the processes governing dynamically-interacting individuals in human collectives under realistic conditions. We present a naturalistic immersive-reality experiment where groups of participants searched for rewards in different environments, studying how individuals weigh personal and social information and how this shapes individual and collective outcomes.
How was the dataset generated?
- Not indicated on dataset level.
- From related publication: 160 participants were recruited from the Max Planck Institute for Human Development (MPIB) recruitment pool and invited in anonymous groups of four to the behavioral laboratory at the MPIB in Berlin, Germany (63 identified as male, 97 as female; Mage = 28.5, SDage = 6.4 years; all were proficient in German and most came from Western, educated, industrialized, rich, and democratic societies70,71). Fourty additional participants were recruited for an individual control condition (14 identified as male, 26 as female; Mage = 29.8, SDage = 5.7 years). See further section: “Procedure”
What research is this dataset connected to?
- Not indicated on dataset level.
- From related GitHub Repository: Deffner, D., Mezey, D., Kahl, B., Schakowski, A., Romanczuk, P., Wu, C. & Kurvers, R. (accepted, in principle, at Nature Communications): “Collective incentives reduce over-exploitation of social information in unconstrained human groups”
- https://doi.org/10.1038/s41467-024-47010-3 (Not provided on dataset level)
Can this dataset be reused?
- The dataset is under a CC BY 4.0, which would allow re-use under the given conditions.
- However, the GitHub repository is connected to an Apache License 2.0, which makes it difficult to know which license applies.

Excursion: Community Standards

Community standards for metadata allow for improved comparison between datasets. Some disciplines have agreed-upon formal standards that define how metadata should be documented. If you publish your data via a discipline specific repository, it is likely that their entry form already guides you through the community metadata standard. You can look for metadata standards that are specific to your discipline e.g. via FAIRsharing.org. If a standard for your research field exists, you should follow it as good as possible.

Even if your discipline does not adhere to strict standards, there is an advantage in using generic standards. By using controlled vocabularies or ontologies for your metadata you can increase the machine-readability and -actionability of your data.

Machine-readability refers to the ability of machines (computers, algorithms, etc.) to automatically understand and interpret the meaning of metadata, such as keywords, descriptions, and concepts, without human intervention. Machine-actionability takes it a step further, enabling machines to not only understand the metadata but also to perform actions or take decisions based on that metadata, such as data integration, filtering, or visualization, without human intervention.

README file

We have already noted that metadata and README files are closely related. They are a great way to introduce and document a dataset and its research context. A README file is a plain text file, often written in markdown and then named ‘README.md’. This naming convention makes it easier for humans and machines to find the README file; for example, in GitHub repositories, the README.md file is automatically identified and its contents displayed on the repository’s main page. The README file should be one of the first files you create for a new project. You can then fill it in with the relevant information.

What should a README file contain? - Ideally, all the answers to the questions in the Metadata section should be included in the README file. To help you decide what to include in your README file, it may be helpful to review the questions in a (Research) Data Management Plan. While your README should be very descriptive, it is also important to keep it concise and clear. For example, avoid using jargon to improve interdisciplinary use.

The README file is most important after someone has downloaded your dataset for possible reuse. Imagine that their data management is sloppy and they lose or forget the link between the downloaded files and your dataset in the repository or the paper you published. The README file included with your dataset ensures that your work can still be understood and that you can still be credited.

README Template

This readme file was generated on [YYYY-MM-DD] by [NAME]

# GENERAL INFORMATION

Title of Dataset:
Description: <provide a short description of the study/project>

Author/Principal Investigator Information
Name:
ORCID:
Institution:
Email:

Date of data collection: <provide single date, range, or approximate date; suggested format YYYY-MM-DD>

Geographic location of data collection: <provide latitude, longitude, or city/region, State, Country>

Information about funding sources that supported the collection of the data:

# SHARING/ACCESS Data

Licenses/restrictions placed on the data:

Links to publications that cite or use the data:

Links to other publicly accessible locations of the data:

Links/relationships to ancillary datasets:

Was data derived from another source? If yes, list source(s):

# DATA & FILE OVERVIEW

File List: <list all files (or folders, as appropriate for dataset organization) contained in the dataset, with a brief description>

Relationship between files, if important:

Additional related data collected that was not included in the current data package:

# METHODOLOGICAL INFORMATION

Description of methods used for collection/generation of data:

Methods for processing the data:

Instrument- or software-specific information needed to interpret the data: <include full name and version of software, and any necessary packages or libraries needed to run scripts>

Standards and calibration information, if appropriate:

Environmental/experimental conditions:

Describe any quality-assurance procedures performed on the data:

People involved with sample collection, processing, analysis and/or submission:

Task 2.2: (~10 min)

We provided you with a template README file. You can copy and paste it into a text editor of your choice to use for your own future projects.

Have a look at the README file of your example dataset:

Where does the provided information of the example README differ from the template below? Count the sections of the template that you can answer solely from the provided README. (The sections will be marked as **bold** in the example solution files.)
How many of the 19 open fields of the template could you answer based on your previous solutions?

Discuss: What information can be found in the README example that is not included in the template but helps to understand the dataset?

For brevity reasons we exclude the last section “METHODOLOGICAL INFORMATION” from the provided example solutions.

Tips for your own README

For your own research project you can, of course, adapt this template to include the information that is relevant to your own dataset. Also look out for other projects in your discipline and what they include in their README files.

You might also suggest to your department/research group/institute/research community to set up a discipline specific README template that can improve the interoperability, reusability and comparability of your research projects.

Solution: Example 1

1 section can be filled based on the Example dataset README alone.
Excluding the “METHODOLOGICAL INFORMATION” section, ~12 of the 19 open fields can be filled based on the information collected in task 2.1.
Find the filled out README template for example 1 here

Solution: Example 2

2 sections can be filled based on the Example dataset README alone.
Excluding the “METHODOLOGICAL INFORMATION” section, ~10 of the 19 open fields can be filled based on the information collected in task 2.1.
Find the filled out README template for example 2 here

Solution: Example 3

5 sections can be filled based on the Example dataset README alone.
Excluding the “METHODOLOGICAL INFORMATION” section, ~10 of the 19 open fields can be filled based on the information collected in task 2.1.
Find the filled out README template for example 3 here
Note: Only the README file for the OSF repository was taken into consideration.

Solution: Example 4

4 sections can be filled based on the Example dataset README alone.
Excluding the “METHODOLOGICAL INFORMATION” section, ~11 of the 19 open fields can be filled based on the information collected in task 2.1.
Find the filled out README template for example 4 here

Codebook

While general metadata and README information typically focus on the dataset or project as a whole, codebooks delve deeper, providing detailed explanations and clarifications of the specific data contained in individual files. A codebook should make all components of a dataset file self-explanatory. For example, it helps to identify what certain variables and value labels mean, or in what format the data points were collected. This increases the FAIRness of the dataset by allowing it to be reused regardless of the availability of an associated research paper. It also reduces the pressure to explain all of these details in a paper to ensure reproducibility. You can keep your explanation of the data short by referring to the codebook instead.

Some people put this information in the README file. However, this can significantly increase the length of the README. We recommend that you keep README files short to maintain readability.

If all of your data files are structured and labelled according to the same system, you only need to provide one codebook that describes and explains this system. If you have differently structured files, you should provide multiple codebooks.

Here are some of the elements that can be included in a codebook:

variables
value labels
measuring units
data types
data formats
data dictionary
data quality notes
technical info

Basic level:

The basic level is to provide this information about your data in a .csv table, .txt file or .md (markdown) file. These are easy to create using a text editor of your choice. They can be opened by anyone, independent of proprietary systems, and are easily readable by humans.

Example of a simple .csv codebook

“variable”,“data_type”,“format”
“species”,“string”,“Adelie,Chinstrap,Gentoo”
“population_size”,“integer”,“0-100000”
“location”,“string”,“Latitude, Longitude”
“date”,“string”,“YYYY-MM-DD”
“latitude”,“number”,“-90 to 90”
“longitude”,“number”,“-180 to 180”

Advanced level:

Depending on the complexity of your data, you can improve your codebook by providing it as a JSON file. This has the advantage to be machine-readable and -actionable. It makes it easier to extract this information and work with it in coding scripts.

Example of a detailed JSON code book

{
  "title": "Penguin Populations Code book",
  "description": "Code book for the Penguin Populations dataset",
  "variables": [
    {
      "name": "species",
      "description": "The species of penguin",
      "data_type": "string",
      "format": ["Adelie", "Chinstrap", "Gentoo"]
    },
    {
      "name": "population_size",
      "description": "The number of penguins in the population",
      "data_type": "integer",
      "format": "0-100000"
    },
    {
      "name": "location",
      "description": "The region or location where the population is found",
      "data_type": "string",
      "format": "Latitude, Longitude"
    },
    {
      "name": "date",
      "description": "The date when the population size was recorded",
      "data_type": "string",
      "format": "YYYY-MM-DD"
    },
    {
      "name": "latitude",
      "description": "The latitude of the location",
      "data_type": "number",
      "format": "-90 to 90"
    },
    {
      "name": "longitude",
      "description": "The longitude of the location",
      "data_type": "number",
      "format": "-180 to 180"
    }
  ],
  "data_dictionary": {
    "species": {
      "Adelie": "A small to medium-sized penguin species found in Antarctica",
      "Chinstrap": "A medium-sized penguin species found in the Antarctic and sub-Antarctic",
      "Gentoo": "A large penguin species found in the Antarctic and sub-Antarctic"
    }
  },
  "data_quality_notes": [
    "Population sizes are estimates and may not reflect the actual number of penguins in the population.",
    "Locations are approximate and may not reflect the exact location of the population.",
    "Dates are in the format YYYY-MM-DD and are in the UTC timezone."
  ],
  "technical_info": {
    "format": "JSON",
    "compression": "gzip",
    "required_software": "JSON parser"
  }
}

Task 2.3: (~10 min)

Open one of the data files of your example dataset.

Can you make sense of the data provided?
Note down difficulties you identify.

Discuss: Can you find supportive information to understand the provided data in other documents? Make a note of them.

Solution: Example 1

The column names within the data files are not very descriptive. However, all data files work with the same columns.
Explanations of the column names, variables, as well as data types an formats are given in the example’s README file.

Solution: Example 2

It is a very large dataset and files are compressed. Explanation on how to extract the data is provided.
Explanations for the column names of the different files are provided in the example’s README file.

Solution: Example 3

The data files can be open directly in the OSF repository. The column names are not very descriptive.
However each data file has a corresponding codebook in .json format that explains each of the columns.
The scripts, how to deploy them and how to generate result from the data is explained in the README file.

Solution: Example 4

The data files are all .log files that can be opened with a standard editor as ;-separated tables.
Columns are named relatively clear.
The scripts, how to deploy them and how to generate result from the data is explained in the README-file.

Further Resources:

“Documentation and Metadata”, The Turing Way, accessed 7. July 2024. https://book.the-turing-way.org/reproducible-research/rdm/rdm-metadata

Footnotes

„metadata“. Merriam-Webster Dictionary, accessed 7. July 2024, https://www.merriam-webster.com/dictionary/metadata.↩︎