We will start by setting up a simple example of a reproducible report.
Create Quarto Project
First, we will need to create a new Quarto project.
If you haven’t already, open RStudio – see Note 1 for how to use the terminal instead. Then, click on File > New Project… to open the New Project Wizard.
Here, select New Directory
And choose the project type Quarto Project.
Finally, enter the name of the directory where our report will be created in, for example code-publishing-exercise.
Also, we will utilize the package renv to track the R packages our project depends on. Using it makes it easier for others to view and obtain them at a later point in time. Therefore make sure that the box Use renv with this project is checked. Again, if this is the first time you are hearing about renv, have a look at the tutorial “Introduction to {renv}”.
If you are already familiar with Markdown and Quarto, you can uncheck the box Use visual markdown editor.
Click on Create Project. Your RStudio window should now look similar to this:
If, like in the image, a Quarto file with some demo content was opened automatically, you can close and delete it, for example, using RStudio’s file manager.
Make sure that your project is in a consistent state according to renv by running:
Console
renv::status()
If it reports packages that are not used, synchronize the lock file using:
Console
renv::snapshot()
Note 1: Without RStudio
Without RStudio, one can create a Quarto project with version control and renv enabled by typing the following into a terminal:
Terminal
quarto create project default code-publishing-exercisecd code-publishing-exercise/rm code-publishing-exercise.qmdgit initgit checkout -b main
Then, one can open an R session by simply typing R into the terminal. Next, make sure that getwd() indicates that the working directory is code-publishing-exercise. Then, initialize renv:
Console
renv::init()
You are now ready to stage and commit your files. You can either stage files separately or the whole project folder at once. If you do the latter, we recommend you to inspect the untracked changes before staging all of them:
In file paths, a period (.) means “the current directory”, while two periods (..) mean “the parent directory”. Therefore git add . means “stage the current directory for committing”.
Terminal
git status
Since no commits have been made so far, this should include every file that is not covered by the .gitignore file. If everything can be staged for committing – as is the case in this tutorial – you can follow up with:
Terminal
git add .git commit -m"Initial commit"
If you see a file you’d rather not commit, delete it or add its name to the .gitignore file. If you don’t check your changes before committing, you might accidentally commit something you’d rather not.
Tip 1
If git commit fails with the message Author identity unknown, you need to tell Git who you are. Run the following commands to set your name and email address:
Before adding your project files, it is helpful to decide on a directory structure, that is, how to call each file and where to put it. In general, the directory structure should facilitate understanding a project by breaking it into logical chunks. There is no single best solution, as a good structure depends on where a project’s complexity lies. However, it is usually helpful if the different files and folders reflect the execution order. For example, if there are multiple data processing stages, one can possibly differentiate input (raw data), intermediate (processed data), and output files (e.g., figures) and put them into separate folders. Similarly, the corresponding code files (e.g., preparation, modeling, visualization) can be prefixed with increasing numbers.
For the purpose of this tutorial, we will provide you with a data set and a corresponding analysis. They are simple enough to be put together in the root folder of your project.
Add Data
You can now download the data set we have prepared for you and put it into your project folder: data.csv
The data set is from the package palmerpenguins (v0.1.1) and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available under the license CC0 1.0.
Important 1: Consider Legal Restrictions Before Sharing
Everything you put into the project folder will be shared publicly. For reasons of reproducibility, this should include the data you analyze. Of course, you should only share them to the extent you are allowed to, taking into account:
applicable privacy laws (e.g., the GDPR for European citizens),
contractual obligations (e.g., with your data provider),
copyright of the data and their particular structure, and
any sui generis database right.
Privacy laws and contractual obligations may require you to create a completely anonymized or synthetic data set1 (if possible), or prohibit any sharing of data, in which case you should provide a reference to a data repository where they can be obtained from. For further information, you can watch the talk “Data anonymity” by Felix Schönbrodt recorded during the LMU Open Science Center Summer School 2023 and have a look at the accompanying slides.
Purely factual data such as measurements are usually not copyrightable, but literary or artistic works that cross the threshold of originality are. Additionally, in some jurisdictions data can be subject to sui generis database rights which prevent extracting substantial parts of a database. As a consequence, you need to ensure that you own or have authority to share the data with respect to copyright and similar rights, and to license it to others (see “Choose a License”).
When distributing a data set, it is important to document the meaning (e.g., units) and valid values of its variables. This is typically done with a data dictionary (also called a codebook). In the following, we will demonstrate how to create a simple data dictionary using the R package pointblank. You can install it now using:
Console
renv::install("pointblank")
You can put the code that follows for creating the data dictionary into a new file called create_data_dictionary.R.
First, we write down everything we know about the data set. This includes:
a general description of the data set
descriptions of all columns
valid values, where applicable
create_data_dictionary.R
table_info <-c(title ="palmerpenguins::penguins",description ="Size measurements for adult foraging penguins near Palmer Station, Antarctica")descriptions <-c(species ="a character string denoting penguin species",island ="a character string denoting island in Palmer Archipelago, Antarctica",bill_length_mm ="a number denoting bill length (millimeters)",bill_depth_mm ="a number denoting bill depth (millimeters)",flipper_length_mm ="an integer denoting flipper length (millimeters)",body_mass_g ="an integer denoting body mass (grams)",sex ="a character string denoting penguin sex",year ="an integer denoting the study year")vals <-list(species =c("Adelie", "Gentoo", "Chinstrap"),island =c("Torgersen", "Biscoe", "Dream"),sex =c("male", "female"),year =c(2007, 2008, 2009))
Depending on the type of data, it may also be necessary to describe measurement instruments, sampling procedures, appropriate weighting, or contact information. In this case, as the data have already been published, we only store a reference to its source:
dat_source <-"Horst A, Hill A, Gorman K (2022). _palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data_. R package version 0.1.1, https://github.com/allisonhorst/palmerpenguins, <https://allisonhorst.github.io/palmerpenguins/>."
Then, we use pointblank to create a data dictionary with this information.
create_data_dictionary.R
vals <-sapply(vals, \(x) {paste0("(", knitr::combine_words(x, and =" or ", before ="`", after ="`"),")" )})dat <-read.csv("data.csv")dict <- pointblank::create_informant( dat,tbl_name =NA,label = table_info[["title"]],lang ="en") |> pointblank::info_tabular(Description = table_info[["description"]],Source = dat_source ) |> pointblank::info_columns_from_tbl(stack(descriptions)[2:1]) |> pointblank::info_columns_from_tbl(stack(vals)[2:1]) |> pointblank::get_informant_report(title ="Data Dictionary for `data.csv`" )dict
Data Dictionary for data.csv
palmerpenguins::penguins
data frameRows344Columns8
Table
DESCRIPTION
Size measurements for adult foraging penguins near Palmer Station, Antarctica
SOURCE
Horst A, Hill A, Gorman K (2022). _palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data_. R package version 0.1.1, https://github.com/allisonhorst/palmerpenguins, < allisonhorst.github.io palmerpenguins>.
Columns
speciescharacterINFO a character string denoting penguin species (`Adelie`, `Gentoo`, or `Chinstrap`)
islandcharacterINFO a character string denoting island in Palmer Archipelago, Antarctica (`Torgersen`, `Biscoe`, or `Dream`)
bill_length_mmnumericINFO a number denoting bill length (millimeters)
bill_depth_mmnumericINFO a number denoting bill depth (millimeters)
flipper_length_mmintegerINFO an integer denoting flipper length (millimeters)
body_mass_gintegerINFO an integer denoting body mass (grams)
sexcharacterINFO a character string denoting penguin sex (`male` or `female`)
yearintegerINFO an integer denoting the study year (`2007`, `2008`, or `2009`)
2024-12-11 13:18:36 UTC< 1 s2024-12-11 13:18:36 UTC
Finally, we can store the data dictionary inside an HTML file and put the HTML file into the project folder as well.
One could go even further by making the information machine-readable in a standardized way. We provide an optional example of that in Note 2. If you want to learn more about the sharing of research data, have a look at the tutorial “FAIR research data management”.
This example demonstrates how the title and description of the data set, the description of the variables and their valid values are stored in a machine-readable way. As before, we also provide a reference to the source.
Generally, metadata are either stored embedded into the data or externally, for example, in a separate file. We will use the “frictionless data” standard, where metadata are stored separately. Another alternative would be RO-Crate.
Specifically, one can use the R package frictionless to create a schema which describes the structure of the data.2 For the purpose of the following code, it is just a nested list that we edit to include our own information. We also explicitly record in the schema that missing values are stored in the data file as NA and that the data are licensed under CC0 1.0. Finally, the package is used to create a metadata file that contains the schema.
Console
# Read data and create schemadat_filename <-"data.csv"dat <-read.csv(dat_filename)dat_schema <- frictionless::create_schema(dat)# Add descriptions to the fieldsdat_schema$fields <-lapply(dat_schema$fields, \(x) {c(x, description = descriptions[[x$name]])})# Record valid valuesdat_schema$fields <-lapply(dat_schema$fields, \(x) {if (x$name %in%names(vals)) {modifyList(x, list(constraints =list(enum = vals[[x$name]]))) } else { x }})# Define missing valuesdat_schema$missingValues <-c("", "NA")# Create package with license info and write itdat_package <- frictionless::create_package() |> frictionless::add_resource(resource_name ="penguins",data = dat_filename,schema = dat_schema,title = table_info[["title"]],description = table_info[["description"]],licenses =list(list(name ="CC0-1.0",path ="https://creativecommons.org/publicdomain/zero/1.0/",title ="CC0 1.0 Universal" )),sources =list(list(title ="CRAN",path = dat_source )) )frictionless::write_package(dat_package, directory =".")
This creates the metadata file datapackage.json in the current directory. Make sure it is located in the same folder as data.csv, as together they comprise a data package.
Having added the data and its documentation, one can view and record the utilized packages with renv…
Console
renv::status()renv::snapshot()
…and go through the commit routine:
Terminal
git statusgit add .git commit -m"Add data"
Add Code
In order to have some code which you can practice to share, we have prepared a simple manuscript for you, alongside a bibliography file. The manuscript contains code together with a written narrative. Download the two files to your computer and put them into your project folder.
The manuscript explores differences in bill length between male and female penguins, feel free to read through it.
Important 2: Take Copyright Seriously
If you include work by others in your project – especially if you intend to make it available publicly –, make sure you have the necessary rights to do so. Only build on existing work for which you receive an express grant of relevant rights. How do you know you are allowed to copy, edit, and share the two files linked above?
As the manuscript uses some new packages, install them with:
Console
renv::install()
The manuscript also uses the Quarto extension “apaquarto”, which typesets documents according to the requirements of the American Psychological Association (2020). It can be installed in the project using the following command:
Terminal
quarto add --no-prompt wjschne/apaquarto
Tip 2: Not a Psychologist?
If you are not a psychologist, you can also skip installing apaquarto. If you installed it by accident, run quarto remove wjschne/apaquarto.
Note, however, that the file Manuscript.qmd we prepared for you uses apaquarto by default and you need to set a different format in the YAML header if you decide not to use apaquarto:
Also, you need to have a \(\TeX\) distribution installed on your computer, which is used in the background to typeset PDF documents. A lightweight choice is TinyTeX, which can be installed with Quarto as follows:
Terminal
quarto install tinytex
You should now be able to render the document using Quarto:
Terminal
quarto render Manuscript.qmd
This should create a PDF file called Manuscript.pdf in your project folder.
Tip 3
If the PDF file cannot be created, try updating Quarto. It comes bundled with RStudio, however, apaquarto sometimes requires more recent versions.
With the code being added, one can use renv again to view and record the new packages:
Console
renv::status()renv::snapshot()
Tip 4
Always run renv::status() and resolve any inconsistencies before you commit code to your project. This way, every commit represents a working state of your project.
Finally, make your changes known to Git:
Terminal
git statusgit add .git commit -m"Add manuscript"
Warning 1: Beware of Credentials
Sometimes, a data analysis requires the interaction with online services:
Data may be collected from social network sites using their APIs3 or downloaded from a data repository, or
an analysis may be conducted with the help of AI providers.
In these cases, make sure that the code you check in to Git does not contain any credentials that are required for accessing these services. Instead, make use of environment variables which are defined in a location that is excluded from version control. When programming with R, you can define them in a file called .Renviron in the root of your project folder:
When you start a new session from the project root, the file is automatically read by R and the environment variables can be accessed using Sys.getenv():
Make sure that .Renviron is added to your .gitignore file in order to exclude it from the Git repository. If you already committed a file that contains credentials, you can follow Chacon & Straub (2024).
Coding Best Practices
Although we provide the code in this example for you, a few things remain to be said about best practices when it comes to writing code that is readable and maintainable.
Use project-relative paths. When you refer to a file within your project, write paths relative to your project root. For example, don’t write C:/Users/Public/Documents/my_project/images/result.png, instead write images/result.png.
Keep it simple. Add complexity only when you must. Whenever there’s a boring way to do something and a clever way, go for the boring way. If the code grows increasingly complex, refactor it into separate functions and files.
Don’t repeat yourself. Use variables and functions before you start to write (or copy-paste) the same thing twice.
Use comments to explain why you do things. The code already shows what you do. Use comments to summarize it and explain why you do it.
Don’t reinvent the wheel. With R, chances are that what you need to do is greatly facilitated by a package from one of many high-quality collections such as rOpenSci, r-lib, Tidyverse, or fastverse.
Think twice about your dependencies. Every dependency increases the risk of irreproducibility in the future. Prefer packages that are well-maintained and light on dependencies4. We also recommend you to read “When should you take a dependency?” by Wickham & Bryan (2023).
Fail early, often, and noisily. Whenever you expect a certain state, use assertions to be sure. In R, you can use stopifnot() to make sure that a condition is actually true.
Test your code. Test your code with scenarios where you know what the result should be. Turn bugs you discovered into test cases. Use linting tools5 to identify common mistakes in your code, for example, the R package lintr.
Read through a style guide and follow it. A style guide is a set of stylistic conventions that improve the code quality. R users are recommended to read Wickham’s (2022) “Tidyverse style guide” and use the R package styler. Python users may benefit from reading the “Style Guide for Python Code” by Rossum et al. (2013). And even if you don’t follow a style guide, be consistent.
This is only a brief summary and there is much more to be learned about coding practices. If you want to dive deeper we recommend the following resources:
If you rely on data or software by others in your research, the question arises whether and how to cite it in your publications.
Data
Put simply, all data relied upon should be cited to allow for precise identification and access. From the “eight core principles of data citation” by Starr et al. (2015), licensed under CC0 1.0:
Principle 1 – Importance: “Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.”
Principle 3 – Evidence: “In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.”
Principle 5 – Access: “Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.”
Principle 7 – Specificity and Verifiability: “Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific time slice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.”
Now, add an appropriate citation for the data set to the manuscript. Does your citation adhere to the principles above?
Note 4: Hint for citing the data set
As the data set is from the R package palmerpenguins, one can use the function citation() to display a suggested citation:
citation("palmerpenguins")
To cite palmerpenguins in publications use:
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
Archipelago (Antarctica) penguin data. R package version 0.1.0.
https://allisonhorst.github.io/palmerpenguins/. doi:
10.5281/zenodo.3960218.
A BibTeX entry for LaTeX users is
@Manual{,
title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},
author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},
year = {2020},
note = {R package version 0.1.0},
doi = {10.5281/zenodo.3960218},
url = {https://allisonhorst.github.io/palmerpenguins/},
}
Copy the BibTeX entry to the file Bibliography.bib and add an identifier between @Manual{ and the comma, such that the entry’s first line reads @Manual{horst2020,. Then, add a sentence to the manuscript such as follows:
The analyzed data are by @horst2020.
Render the document to check that the citation is displayed properly.
Terminal
quarto render Manuscript.qmd
Software
When it comes to software, the answer is a little more nuanced due to the large number of involved dependencies. You can consult Figure 1 for general advice whether to cite a particular piece of software or not. As with data, citations should allow for exact identification and access. From the six “software citation principles” by Smith et al. (2016), licensed under CC BY 4.0:
1. Importance: Software should be considered a legitimate and citable product of research. Software citations should be accorded the same importance in the scholarly record as citations of other research products, such as publications and data; they should be included in the metadata of the citing work, for example in the reference list of a journal article, and should not be omitted or separated. Software should be cited on the same basis as any other research product such as a paper or a book, that is, authors should cite the appropriate set of software products just as they cite the appropriate set of papers.
5. Accessibility: Software citations should facilitate access to the software itself and to its associated metadata, documentation, data, and other materials necessary for both humans and machines to make informed use of the referenced software.
6. Specificity: Software citations should facilitate identification of, and access to, the specific version of software that was used. Software identification should be as specific as necessary, such as using version numbers, revision numbers, or variants such as platforms.
In practice, the first step is to identify all pieces of software the project relies on. A few of them are obvious, such as R itself, Quarto, and the \(\TeX\) distribution we installed before. Then there are the individual R packages, Quarto extensions, and \(\TeX\) packages. All of them, in turn, may have dependencies and it is up to you decide when not to dig deeper. For example, some R packages are only thin wrappers around other R packages or around system dependencies which also might deserve credit. A system dependency is additional software that you require on your computer apart from the R package.
Now, add references for the software you would like to cite to the manuscript. In the following, we will demonstrate this for R and all R packages by using the R package grateful. For arbitrary software, you can use the CiteAs service to create appropriate citations.
Add the following code chunk to the end of the discussion in the manuscript:
This will automatically create a paragraph citing all used packages and generate the bibliography file grateful-refs.bib.6 Then, in the YAML header, add grateful-refs.bib by setting the bibliography as follows:
Use renv to view, install, and record the newly used package:
Console
renv::status()renv::install()renv::snapshot()
Finally, render the document again and commit the changes:
Terminal
quarto render Manuscript.qmdgit statusgit add .git commit -m"Cite data and software"
The Last Mile
renv only records the versions of R packages and of R itself. This means that everything we have not decided to cite in the previous step is not documented anywhere. We will cover system dependencies when creating a README. For now, however, there is one simple step you can take to record the version of Quarto (and a few other dependencies). Do run the following:
Terminal
quarto use binder
This will create a few additional files which facilitate reconstructing the computational environment in the future.7 As always, commit your changes:
Fowler, M., Beck, K., Opdyke, W., & Roberts, D. (1999). Refactoring: Improving the design of existing code. Addison-Wesley.
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986
Publication manual of the american psychological association (7th ed.). (2020). American Psychological Association. https://doi.org/10.1037/0000165-000
Raymond, E. S. (2003). The art of UNIX programming: With contributions from thirteen UNIX pioneers, including its inventor, ken thompson. Addison-Wesley. https://www.arp242.net/the-art-of-unix-programming/
Rossum, G. van, Warsaw, B., & Coghlan, A. (2013, August 1). Style guide for python code. Python enhancement proposals (PEPs). https://peps.python.org/pep-0008/
Smith, A. M., Katz, D. S., Niemeyer, K. E., & FORCE11 Software Citation Working Group. (2016). Software citation principles. PeerJ Computer Science, 2, e86. https://doi.org/10.7717/peerj-cs.86
Starr, J., Castro, E., Crosas, M., Dumontier, M., Downs, R. R., Duerr, R., Haak, L. L., Haendel, M., Herman, I., Hodson, S., Hourclé, J., Kratz, J. E., Lin, J., Nielsen, L. H., Nurnberger, A., Proell, S., Rauber, A., Sacchi, S., Smith, A., … Clark, T. (2015). Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Computer Science, 1, e1. https://doi.org/10.7717/peerj-cs.1
Vuorre, M., & Crump, M. J. C. (2021). Sharing and organizing research products as R packages. Behavior Research Methods, 53(2), 792–802. https://doi.org/10.3758/s13428-020-01436-x
In June 2024, version 2 of the frictionless data standard has been released. As of November 2024, the R package frictionless only supports the first version, though support for v2 is planned.↩︎
An application programming interface provides the capability to interact with other software using a programming language.↩︎
You can use the function pak::pkg_deps() to count the total number of package dependencies in R.↩︎
A linting tool analyzes your code without actually running it. Therefore, this process is also called static code analysis.↩︎
Note that this automatic detection can miss packages in some circumstances, therefore always verify the rendered result.↩︎