Introduction

In the following, we will set the stage by highlighting the importance of sharing all materials, documenting their usage, and linking results with their computations.

Document Materials

However, there’s more to code publishing than sharing. Researchers should document (i.e., track) which data, software, images, texts, and other materials contributed to their work. In more elaborate terms, researchers should preserve the provenance of shared materials, for at least three tangible reasons:

Academic integrity: Providing citations for works that are not one’s own avoids plagiarism.
Complying with the law: Providing attribution and license texts (among other things) may be a legal obligation when redistributing materials.
Facilitating reproductions: Stating utilized software and data with their exact versions helps reproducers.

Of course, when and how to reserve provenance varies for these reasons. Rephrasing someone’s text avoids issues of copyright, but may be plagiarism. And bibliographies of various styles may comply with scientific citation standards, but if they omit the first names of authors or the version number of computer programs, they might not help with matters of copyright or reproducibility.

In this tutorial, we’ll consider all of these reasons important. And there are even more: Documenting the provenance puts research into context and allows others to understand how it came about. Also, cited authors benefit from the citation as they gain potential readers.

Linking Results and Computations

For the purpose of this tutorial, tracking the provenance of results deserves particular attention. This means connecting them to the source code through the creation of dynamic documents. Rather than manually copying numerical results, figures, or tables, they are inserted automatically upon rendering of the article. Dynamic documents bundled together with any necessary data and auxiliary software are called research compendia (Gentleman & Temple Lang, 2007).

The practice to interleave narrative text with code has its roots in the paradigm of literate programming, where documentation and source code are treated as equals and are arranged in a way to maximize understanding (Knuth, 1984). Alternating text and code can be also found in notebook interfaces for exploratory programming (e.g., Wolfram Mathematica or Jupyter Notebooks, see Kluyver et al., 2016) that also have the capability to execute code and embed its output. With Sweave (Leisch, 2002), ideas from both world – literate programming and embedding program output – were combined into one tool for rendering dynamic documents using the R programming language. It is the predecessor of the R package knitr (Xie, 2015) which is being used under the hood in this tutorial.¹

Linking results with their computations has benefits for authors and readers. For the author, articles always contain the most recent version of figures, as they are updated automatically when the computation changes. For the readers, it enables understanding exactly how a particular result was obtained if they get access to the underlying research compendium.

Best Practices

To recap, in this tutorial you will share your materials, document their usage, and connect results to the underlying source code through the creation of a research compendium. Make sure to consider the following things along the way (by Arguillas et al., 2022, licensed under CC BY 4.0):

Does the research compendium contain everything needed to reproduce a predefined outcome in an organized and parsimonious way?

Completeness: The research compendium contains all of the objects needed to reproduce a predefined outcome.

Organization: It is easy to understand and keep track of the various objects in the research compendium and their relationship over time.

Economy: Fewer extraneous objects in the compendium mean fewer things that can break and require less maintenance over time.

Is descriptive information about the research compendium and its components available and easy to understand?

Transparency: The research compendium provides full disclosure of the research process that produced the scientific claim.

Documentation: Information describing compendium objects is provided in enough detail to enable independent understanding and use of the compendium.

Is information about how the research compendium and its components can be used available and easy to understand?

Access: It is clear who can use what, how, and under what conditions, with open access preferred.

Provenance: The origin of the components of the research compendium and how each has changed over time is evident.

Is information about the research compendium and its components embedded in code?

Metadata: Information about the research compendium and its component is embedded in a standardized, machine-readable code.

Automation: As much as possible, the computational workflow is script- or workflow-based so that the workflow can be re-executed using minimal actions.

Is there a plan for reviewing the research compendium for FAIR and computational reproducibility standards over time?

Review: A series of managed activities are needed to ensure continued access to and functionality of the research compendium and its components for as long as necessary.

Although this tutorial guides you through the creation of a research compendium, you are invited to revisit these questions after completion and check whether and how each point was addressed (or not). Further, you can consult them as a checklist for future projects.

References

Arguillas, F., Christian, T.-M., Gooch, M., Honeyman, T., Peer, L., & CURE-FAIR WG. (2022). 10 things for curating reproducible and FAIR research. Research Data Alliance; https://curating4reproducibility.org/10things/. https://doi.org/10.15497/RDA00074

Buckheit, J. B., & Donoho, D. L. (1995). WaveLab and reproducible research. In A. Antoniadis & G. Oppenheim (Eds.), Wavelets and statistics (Vol. 103, pp. 55–81). Springer New York. https://doi.org/10.1007/978-1-4612-2544-7_5

Gentleman, R., & Temple Lang, D. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics, 16(1), 1–23. https://doi.org/10.1198/106186007X178663

Ince, D. C., Hatton, L., & Graham-Cumming, J. (2012). The case for open computer programs. Nature, 482(7386), 485–488. https://doi.org/10.1038/nature10836

Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., Willing, C., & Jupyter development team. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. In F. Loizides & B. Scmidt (Eds.), Positioning and power in academic publishing: Players, agents and agendas (pp. 87–90). IOS Press. https://doi.org/10.3233/978-1-61499-649-1-87

Knuth, D. E. (1984). Literate programming. The Computer Journal, 27(2), 97–111. https://doi.org/10.1093/comjnl/27.2.97

Leisch, F. (2002). Sweave: Dynamic generation of statistical reports using literate data analysis. In W. Härdle & B. Rönz (Eds.), Compstat (pp. 575–580). Physica-Verlag HD. https://doi.org/10.1007/978-3-642-57489-4_89

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press. https://doi.org/10.17226/25303

Xie, Y. (2015). Dynamic documents with R and knitr (Second edition). CRC Press.

Footnotes

Specifically, Quarto employs knitr to execute chunks of R code.↩︎

The Importance of Sharing

Document Materials

Linking Results and Computations

Best Practices

References

Footnotes