Introduction

In the following, we will provide a brief introduction to the concept of research compendia.

The Importance of Sharing

Suppose you are reading an article about a new imaging method to turn seismological data into subsurface images. The article describes the ideas that went into developing this method and presents a few examples to illustrate its superiority over previous approaches. You got interested and would like to apply this method to your own data. However, with only the article available, it could take months to come up with a working solution, if possible at all. This situation has been put aptly by Buckheit & Donoho (1995, p. 59), distilling an idea by the geophysicist Jon Claerbout:

“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

Even when researchers merely apply existing methods (rather than report on a new method), sharing the source code and being transparent about the computational environment is imperative to making their results reproducible (Ince et al., 2012). By reproducibility, we mean “obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis” (National Academies of Sciences, Engineering, and Medicine, 2019, p. 46).

Linking Results and Computations

This tutorial not only covers sharing the source code, but also connecting it to the results through the creation of dynamic documents. Rather than manually copying numerical results, figures, or tables, they are inserted automatically upon rendering of the article. Dynamic documents bundled together with any necessary data and auxiliary software are called a research compendium (Gentleman & Temple Lang, 2007).

The practice to interleave narrative text with code has its roots in the paradigm of literate programming, where documentation and source code are treated as equals and are arranged in a way to maximize understanding (Knuth, 1984). Alternating text and code can be also found in notebook interfaces for exploratory programming (e.g., Wolfram Mathematica or Jupyter Notebooks, see Kluyver et al., 2016) that also have the capability to execute code and embed its output. With Sweave (Leisch, 2002), ideas from both worlds – literate programming and embedding program output – were combined into one tool for rendering dynamic documents using the R programming language. It is the predecessor of the R package knitr (Xie, 2015) which is being used under the hood in this tutorial.1

Linking results with their computations has benefits for authors and readers. For the author, articles always contain the most recent version of figures, as they are updated automatically when the computation changes. For the readers, it enables understanding exactly how a particular result was obtained if they get access to the underlying research compendium.

Best Practices

When creating a research compendium, there are a few things to consider (by Arguillas et al., 2022, licensed under CC BY 4.0):

Does the research compendium contain everything needed to reproduce a predefined outcome in an organized and parsimonious way?

  1. Completeness: The research compendium contains all of the objects needed to reproduce a predefined outcome.
  2. Organization: It is easy to understand and keep track of the various objects in the research compendium and their relationship over time.
  3. Economy: Fewer extraneous objects in the compendium mean fewer things that can break and require less maintenance over time.

Is descriptive information about the research compendium and its components available and easy to understand?

  1. Transparency: The research compendium provides full disclosure of the research process that produced the scientific claim.
  2. Documentation: Information describing compendium objects is provided in enough detail to enable independent understanding and use of the compendium.

Is information about how the research compendium and its components can be used available and easy to understand?

  1. Access: It is clear who can use what, how, and under what conditions, with open access preferred.
  2. Provenance: The origin of the components of the research compendium and how each has changed over time is evident.

Is information about the research compendium and its components embedded in code?

  1. Metadata: Information about the research compendium and its components is embedded in a standardized, machine-readable code.
  2. Automation: As much as possible, the computational workflow is script- or workflow-based so that the workflow can be re-executed using minimal actions.

Is there a plan for reviewing the research compendium for FAIR and computational reproducibility standards over time?

  1. Review: A series of managed activities are needed to ensure continued access to and functionality of the research compendium and its components for as long as necessary.

Although this tutorial guides you through the creation of a research compendium, you are invited to revisit these questions after completion and check whether and how each point was addressed (or not). Further, you can consult them as a checklist for future projects.

Back to top

References

Arguillas, F., Christian, T.-M., Gooch, M., Honeyman, T., Peer, L., & CURE-FAIR WG. (2022). 10 things for curating reproducible and FAIR research. Research Data Alliance; https://curating4reproducibility.org/10things/. https://doi.org/10.15497/RDA00074
Buckheit, J. B., & Donoho, D. L. (1995). WaveLab and reproducible research. In A. Antoniadis & G. Oppenheim (Eds.), Wavelets and statistics (Vol. 103, pp. 55–81). Springer New York. https://doi.org/10.1007/978-1-4612-2544-7_5
Gentleman, R., & Temple Lang, D. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1), 1–23. https://doi.org/10.1198/106186007X178663
Ince, D. C., Hatton, L., & Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482(7386), 485–488. https://doi.org/10.1038/nature10836
Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., Willing, C., & Jupyter development team. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Scmidt (Eds.), Positioning and power in academic publishing: Players, agents and agendas (pp. 87–90). IOS Press. https://doi.org/10.3233/978-1-61499-649-1-87
Knuth, D. E. (1984). Literate programming. The Computer Journal, 27(2), 97–111. https://doi.org/10.1093/comjnl/27.2.97
Leisch, F. (2002). Sweave: Dynamic generation of statistical reports using literate data analysis. In W. Härdle & B. Rönz (Eds.), Compstat (pp. 575–580). Physica-Verlag HD. https://doi.org/10.1007/978-3-642-57489-4_89
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press. https://doi.org/10.17226/25303
Xie, Y. (2015). Dynamic documents with R and knitr (Second edition). CRC Press.

Footnotes

  1. Specifically, Quarto employs knitr to execute chunks of R code.↩︎