Introduction
In the following, we will provide a brief introduction to the concept of research compendia.
The Importance of Sharing
Suppose you are reading an article about a new imaging method to turn seismological data into subsurface images. The article describes the ideas that went into developing this method and presents a few examples to illustrate its superiority over previous approaches. You got interested and would like to apply this method to your own data. However, with only the article available, it could take months to come up with a working solution, if possible at all. This situation has been put aptly by Buckheit & Donoho (1995, p. 59), distilling an idea by the geophysicist Jon Claerbout:
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”
Even when researchers merely apply existing methods (rather than report on a new method), sharing the source code and being transparent about the computational environment is imperative to making their results reproducible (Ince et al., 2012). By reproducibility, we mean “obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis” (National Academies of Sciences, Engineering, and Medicine, 2019, p. 46).
Linking Results and Computations
This tutorial not only covers sharing the source code, but also connecting it to the results through the creation of dynamic documents. Rather than manually copying numerical results, figures, or tables, they are inserted automatically upon rendering of the article. Dynamic documents bundled together with any necessary data and auxiliary software are called a research compendium (Gentleman & Temple Lang, 2007).
The practice to interleave narrative text with code has its roots in the paradigm of literate programming, where documentation and source code are treated as equals and are arranged in a way to maximize understanding (Knuth, 1984). Alternating text and code can be also found in notebook interfaces for exploratory programming (e.g., Wolfram Mathematica or Jupyter Notebooks, see Kluyver et al., 2016) that also have the capability to execute code and embed its output. With Sweave (Leisch, 2002), ideas from both worlds – literate programming and embedding program output – were combined into one tool for rendering dynamic documents using the R programming language. It is the predecessor of the R package knitr
(Xie, 2015) which is being used under the hood in this tutorial.1
Linking results with their computations has benefits for authors and readers. For the author, articles always contain the most recent version of figures, as they are updated automatically when the computation changes. For the readers, it enables understanding exactly how a particular result was obtained if they get access to the underlying research compendium.
Best Practices
When creating a research compendium, there are a few things to consider (by Arguillas et al., 2022, licensed under CC BY 4.0):
Does the research compendium contain everything needed to reproduce a predefined outcome in an organized and parsimonious way?
- Completeness: The research compendium contains all of the objects needed to reproduce a predefined outcome.
- Organization: It is easy to understand and keep track of the various objects in the research compendium and their relationship over time.
- Economy: Fewer extraneous objects in the compendium mean fewer things that can break and require less maintenance over time.
Is descriptive information about the research compendium and its components available and easy to understand?
- Transparency: The research compendium provides full disclosure of the research process that produced the scientific claim.
- Documentation: Information describing compendium objects is provided in enough detail to enable independent understanding and use of the compendium.
Is information about how the research compendium and its components can be used available and easy to understand?
- Access: It is clear who can use what, how, and under what conditions, with open access preferred.
- Provenance: The origin of the components of the research compendium and how each has changed over time is evident.
Is information about the research compendium and its components embedded in code?
- Metadata: Information about the research compendium and its components is embedded in a standardized, machine-readable code.
- Automation: As much as possible, the computational workflow is script- or workflow-based so that the workflow can be re-executed using minimal actions.
Is there a plan for reviewing the research compendium for FAIR and computational reproducibility standards over time?
- Review: A series of managed activities are needed to ensure continued access to and functionality of the research compendium and its components for as long as necessary.
Although this tutorial guides you through the creation of a research compendium, you are invited to revisit these questions after completion and check whether and how each point was addressed (or not). Further, you can consult them as a checklist for future projects.
References
Footnotes
Specifically, Quarto employs
knitr
to execute chunks of R code.↩︎