Introduction
In the following, we will set the stage by highlighting the importance of sharing all materials, documenting their usage, and linking results with their computations, leading to the creation research compendia.
The Importance of Sharing
Suppose you are reading an article about a new imaging method to turn seismological data into subsurface images. The article describes the ideas that went into developing this method and presents a few examples to illustrate its superiority over previous approaches. You got interested and would like to apply this method to your own data. However, with only the article available, it could take you months to come up with a working solution, if possible at all. This situation has been put aptly by Buckheit & Donoho (1995, p. 59, emphasis in original), distilling an idea by the geophysicist Jon Claerbout:
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”
Even when researchers merely apply existing methods (rather than report on a new method), sharing the source code and being transparent about the computational environment is imperative to making their results reproducible (Ince et al., 2012). By reproducibility, we mean “obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis” (National Academies of Sciences, Engineering, and Medicine, 2019, p. 46).
Document Materials
However, there’s more to code publishing than sharing. Researchers should document (i.e., track) which foreign data, software, images, texts, and other materials contributed to their work. In more elaborate terms, researchers should preserve the provenance of shared materials, for at least three tangible reasons:
- Academic integrity: Providing citations for works that are not one’s own avoids plagiarism.
- Complying with the law: Providing attribution and license texts (among other things) may be a legal obligation when redistributing materials.
- Facilitating reproductions: Stating utilized software and data with their exact versions helps reproducers.
Of course, when and how to reserve provenance varies for these reasons. Rephrasing someone’s text avoids issues of copyright, but may be plagiarism. And bibliographies of various styles may comply with scientific citation standards, but if they omit the first names of authors or the version number of computer programs, they might not help with matters of copyright or reproducibility.
In this tutorial, we’ll consider all of these reasons important. And there are even more: Documenting the provenance puts research into context and allows others to understand how it came about. Also, cited authors benefit from the citation as they gain potential readers.
Linking Results and Computations
For the purpose of this tutorial, tracking the provenance of results deserves particular attention. This means connecting them to the source code through the creation of dynamic documents. Rather than manually copying numerical results, figures, or tables, they are inserted automatically upon rendering of the article. Dynamic documents bundled together with any necessary data and auxiliary software are called research compendia (Gentleman & Temple Lang, 2007).
The practice to interleave narrative text with code has its roots in the paradigm of literate programming, where documentation and source code are treated as equals and are arranged in a way to maximize understanding (Knuth, 1984). Alternating text and code can be also found in notebook interfaces for exploratory programming (e.g., Wolfram Mathematica or Jupyter Notebooks, see Kluyver et al., 2016) that also have the capability to execute code and embed its output. With Sweave (Leisch, 2002), ideas from both worlds – literate programming and embedding program output – were combined into one tool for rendering dynamic documents using the R programming language. It is the predecessor of the R package knitr
(Xie, 2015) which is being used under the hood in this tutorial.1
Linking results with their computations has benefits for authors and readers. For the author, articles always contain the most recent version of figures, as they are updated automatically when the computation changes. For the readers, it enables understanding exactly how a particular result was obtained if they get access to the underlying research compendium.
Best Practices
To recap, in this tutorial you will share your materials, document their usage, and connect results to the underlying source code through the creation of a research compendium. Make sure to consider the following things along the way (by Arguillas et al., 2022, licensed under CC BY 4.0):
Does the research compendium contain everything needed to reproduce a predefined outcome in an organized and parsimonious way?
- Completeness: The research compendium contains all of the objects needed to reproduce a predefined outcome.
- Organization: It is easy to understand and keep track of the various objects in the research compendium and their relationship over time.
- Economy: Fewer extraneous objects in the compendium mean fewer things that can break and require less maintenance over time.
Is descriptive information about the research compendium and its components available and easy to understand?
- Transparency: The research compendium provides full disclosure of the research process that produced the scientific claim.
- Documentation: Information describing compendium objects is provided in enough detail to enable independent understanding and use of the compendium.
Is information about how the research compendium and its components can be used available and easy to understand?
- Access: It is clear who can use what, how, and under what conditions, with open access preferred.
- Provenance: The origin of the components of the research compendium and how each has changed over time is evident.
Is information about the research compendium and its components embedded in code?
- Metadata: Information about the research compendium and its components is embedded in a standardized, machine-readable code.
- Automation: As much as possible, the computational workflow is script- or workflow-based so that the workflow can be re-executed using minimal actions.
Is there a plan for reviewing the research compendium for FAIR and computational reproducibility standards over time?
- Review: A series of managed activities are needed to ensure continued access to and functionality of the research compendium and its components for as long as necessary.
Although this tutorial guides you through the creation of a research compendium, you are invited to revisit these questions after completion and check whether and how each point was addressed (or not). Further, you can consult them as a checklist for future projects.
References
Footnotes
Specifically, Quarto employs
knitr
to execute chunks of R code.↩︎