We will start by setting up a simple example of a reproducible report.
Create Quarto Project
First, we will need to create a new Quarto project.
If you haven’t already, open RStudio – see Note 1 for how to use the terminal instead. Then, click on File > New Project… to open the New Project Wizard.
Here, select New Directory
And choose the project type Quarto Project.
Finally, enter the name of the directory where our report will be created in, for example code-publishing-exercise.
Also, we will utilize the package renv to track the R packages our project depends on. Using it makes it easier for others to view and obtain them at a later point in time. Therefore make sure that the box Use renv with this project is checked. Again, if this is the first time you are hearing about renv, have a look at the tutorial “Introduction to {renv}”.
If you are already familiar with Markdown and Quarto, you can uncheck the box Use visual markdown editor.
Click on Create Project. Your RStudio window should now look similar to this:
If, like in the image, a Quarto file with some demo content was opened automatically, you can close and delete it, for example, using RStudio’s file manager.
Make sure that your project is in a consistent state according to renv by running:
Console
renv::status()
If it reports packages that are not used, synchronize the lock file using:
Console
renv::snapshot()
Note 1: Without RStudio
Without RStudio, one can create a Quarto project with version control and renv enabled by typing the following into a terminal:
Terminal
quarto create project default code-publishing-exercisecd code-publishing-exercise/rm code-publishing-exercise.qmdgit initgit checkout -b main
Then, one can open an R session, by simply typing R into the terminal. Next, make sure that getwd() indicates that the working directory is code-publishing-exercise. If not, set it using setwd("code-publishing-exercise"). Then, initialize renv:
Console
renv::init()
You are now ready to stage and commit your files. You can either stage files separately or the whole project folder at once. If you do the latter, we recommend you to inspect the untracked changes before staging all of them:
In file paths, a period (.) means “the current directory”, while two periods (..) mean “the parent directory”. Therefore git add . means “stage the current directory for committing”.
Terminal
git status
Since no commits have been made so far, this should include every file that is not covered by the .gitignore file. If everything can be staged for committing – as is the case in this tutorial – you can follow up with:
Terminal
git add .git commit -m"Initial commit"
If you see a file you’d rather not commit, delete it or add its name to the .gitignore file. If you don’t check your changes before committing, you might accidentally commit something you’d rather not.
Tip 1
If git commit fails with the message Author identity unknown, you need to tell Git who you are. Run the following commands to set your name and email address:
Before adding your project files, it is helpful to decide on a directory structure, that is, how to call each file and where to put it. In general, the directory structure should facilitate understanding a project by breaking it into logical chunks. There is no single best solution, as a good structure depends on where a project’s complexity lies. However, it is usually helpful if the different files and folders reflect the execution order. For example, if there are multiple data processing stages, one can possibly differentiate input (raw data), intermediate (processed data), and output files (e.g., figures) and put them into separate folders. Similarly, the corresponding code files (e.g., preparation, modeling, visualization) can be prefixed with increasing numbers.
For the purpose of this tutorial, we will provide you with a data set and a corresponding analysis. They are simple enough to be put together in the root folder of your project.
Add Data
You can now download the data set we have prepared for you and put into your project folder: Data.csv
The data set is from the package palmerpenguins and contains the recorded bill lengths and sex of penguins living on three islands in the Palmer Archipelago, Antarctica. It was made available under the license CC0 1.0.
Warning 1: Consider Legal Restrictions Before Sharing
Everything you put into the project folder will be shared publicly. For reasons of reproducibility, this should include the data you analyze. Of course, you should only share them to the extent you are allowed to, taking into account:
applicable privacy laws (e.g., the GDPR for European citizens),
contractual obligations (e.g., with your data provider),
copyright of the data and their particular structure, and
any sui generis database right.
Privacy laws and contractual obligations may require you to create a completely anonymized or synthetic dataset1 (if possible), or prohibit any sharing of data, in which case you should provide a reference to a data repository where they can be obtained from. For further information, you can watch the talk “Data anonymity” by Felix Schönbrodt recorded during the LMU Open Science Center Summer School 2023 and have a look at the accompanying slides.
Purely factual data such as measurements are usually not copyrightable, but literary or artistic works that cross the threshold of originality are. Additionally, in some jurisdictions data can be subject to sui generis database rights which prevent extracting substantial parts of a database. As a consequence, you need to ensure that you own or have authority to share the data with respect to copyright and similar rights, and to license it to others (see “Choose a License”).
When publishing a data set, it is important to document the meaning (e.g., units) and possible values of its variables. This is typically done with a data dictionary (also called a codebook). In the following, we will demonstrate how to create a simple data dictionary using the R packages tinylabels, datawizard, and tinytable. You can install them now using:
You can put the code that follows for creating the data dictionary into a new file called create_data_dictionary.R.
Using tinylabels we can add labels to the variables of a data.frame in R:2
create_data_dictionary.R
## dat <- read.csv("Data.csv")descriptions <-c(species ="a character string denoting penguin species",island ="a character string denoting island in Palmer Archipelago, Antarctica",bill_length_mm ="a number denoting bill length (millimeters)",bill_depth_mm ="a number denoting bill depth (millimeters)",flipper_length_mm ="an integer denoting flipper length (millimeters)",body_mass_g ="an integer denoting body mass (grams)",sex ="a character string denoting penguin sex",year ="an integer denoting the study year")tinylabels::variable_label(dat) <- descriptions
Subsequently, datawizard can be employed to create the data dictionary containing the name and label, but also some other information about each variable:
A human-readable data dictionary is necessary for making one’s research reproducible and the example we provided demonstrates only the bare minimum. A full data documentation including measurement instruments, sampling procedures, appropriate weighting, contact information, and more information about the study can be created with the R package pointblank. And one could go even further by making the information machine-readable in a standardized way. We provide an optional example of that in Note 2. If you want to learn more about the sharing of research data, have a look at the tutorial “FAIR research data management”.
This example demonstrates how the title and description of the data set, the description of the variables and their possible values are stored in a machine-readable way.
table_info <-c(title ="penguins dataset",description ="Size measurements for adult foraging penguins near Palmer Station, Antarctica")descriptions <-c(species ="a character string denoting penguin species",island ="a character string denoting island in Palmer Archipelago, Antarctica",bill_length_mm ="a number denoting bill length (millimeters)",bill_depth_mm ="a number denoting bill depth (millimeters)",flipper_length_mm ="an integer denoting flipper length (millimeters)",body_mass_g ="an integer denoting body mass (grams)",sex ="a character string denoting penguin sex",year ="an integer denoting the study year")vals <-list(species =c("Adelie", "Gentoo", "Chinstrap"),island =c("Torgersen", "Biscoe", "Dream"),sex =c("male", "female"),year =c(2007, 2008, 2009))
Generally, metadata are either stored embedded into the data or externally, for example, in a separate file. We will use the “frictionless data” standard, where metadata are stored separately.
Specifically, one can use the R package frictionless to create a schema which describes the structure of the data. For the purpose of the following code, it is just a nested list that we edit to include our own information. We also explicitly record in the schema that missing values are stored in the data file as NA and that the data are licensed under CC0 1.0. Finally, the package is used to create a metadata file that contains the schema.
# Read data and create schemadat_filename <-"Data.csv"dat <-read.csv(dat_filename)dat_schema <- frictionless::create_schema(dat)# Add descriptions to the fieldsdat_schema$fields <-lapply(dat_schema$fields, \(x) {c(x, description = descriptions[[x$name]])})# Record possible valuesdat_schema$fields <-lapply(dat_schema$fields, \(x) {if (x$name %in%names(vals)) {modifyList(x, list(constraints =list(enum = vals[[x$name]]))) } else { x }})# Define missing valuesdat_schema$missingValues <-c("", "NA")# Create package with license info and write itdat_package <- frictionless::create_package() |> frictionless::add_resource(resource_name ="penguins",data = dat_filename,schema = dat_schema,title = table_info[["title"]],description = table_info[["description"]],licenses =list(list(name ="CC0-1.0",path ="https://creativecommons.org/publicdomain/zero/1.0/",title ="CC0 1.0 Universal" )) )frictionless::write_package(dat_package, directory =".")
This creates the metadata file datapackage.json in the current directory. Make sure it is located in the same folder as Data.csv, as together they comprise a data package.
Having added the data and its documentation, one can view and record the utilized packages with renv…
Console
renv::status()renv::snapshot()
…and go through the commit routine:
Terminal
git statusgit add .git commit -m"Add data"
Add Code
In order to have some code which you can practice to share, we have prepared a simple manuscript for you, alongside a bibliography file. The manuscript contains code together with a written narrative. Download the two files to your computer and put them into your project folder.
The manuscript explores differences in bill length between male and female penguins, feel free to read through it.
Warning 2: Take Copyright Seriously
When you include work by others in your project – especially if you intend to make it available publicly –, make sure you have the necessary rights to do so. Only build on existing work for which you are given an express grant of relevant rights. How do you know you are allowed to copy, edit, and share the two files linked above?
As the manuscript uses some new packages, install them with:
Console
renv::install()
The manuscript also uses the Quarto extension “apaquarto”, which typesets documents according to the requirements of the American Psychological Association (2020). It can be installed in the project using the following command:
Terminal
quarto add --no-prompt wjschne/apaquarto
Tip 2: Not a psychologist?
If you are not a psychologist, you can also skip installing apaquarto. If you installed it by accident, run quarto remove wjschne/apaquarto.
Note, however, that the file Manuscript.qmd we prepared for you uses apaquarto by default and you need to set a different format in the YAML header if you decide not to use apaquarto:
Also, you need to have a \(\TeX\) distribution installed on your computer, which is used in the background to typeset PDF documents. A lightweight choice is TinyTeX, which can be installed with Quarto as follows:
Terminal
quarto install tinytex
You should now be able to render the document using Quarto:
Terminal
quarto render Manuscript.qmd
This should create a PDF file called Manuscript.pdf in your project folder.
Tip 3
If the PDF file cannot be created, try updating Quarto. It comes bundled with RStudio, however apaquarto sometimes requires more recent versions.
With the code being added, one can use renv again to view and record the new packages:
Console
renv::status()renv::snapshot()
Tip 4
Always run renv::status() and resolve any inconsistencies before you commit code to your project. This way, every commit represents a working state of your project.
Finally, make your changes known to Git:
Terminal
git statusgit add .git commit -m"Add manuscript"
Warning 3: Beware of Credentials
Sometimes, a data analysis requires the interaction with online services:
Data may be collected from social network sites using their APIs3 or downloaded from a data repository, or
an analysis may be conducted with the help of AI providers.
In these cases, make sure that the code you check in to Git does not contain any credentials that are required for accessing these services. Instead, make use of environment variables which are defined in a location that is excluded from version control. When programming with R, you can define them in a file called .Renviron in the root of your project folder:
When you start a new session from the project root, the file is automatically read by R and the environment variables can be accessed using Sys.getenv():
Make sure that .Renviron is added to your .gitignore file in order to exclude it from the Git repository. If you already committed a file that contains credentials, you can follow Chacon and Straub (2024).
Coding Best Practices
Although we provide the code in this example for you, a few things remain to be said about best practices when it comes to writing code that is readable and maintainable.
Use project-relative paths. When you refer to a file within your project, write paths relative to your project root. For example, don’t write C:/Users/Public/Documents/my_project/images/result.png, instead write images/result.png.
Keep it simple. Add complexity only when you must. Whenever there’s a boring way to do something and a clever way, go for the boring way. If the code grows increasingly complex, refactor it into separate functions and files.
Don’t repeat yourself. Use variables and functions before you start to write (or copy-paste) the same thing twice.
Use comments to explain why you do things. The code already shows what you do. Use comments to summarize it and explain why you do it.
Don’t reinvent the wheel. With R, chances are that what you need to do is greatly facilitated by a package from one of many high-quality collections such as rOpenSci, r-lib, Tidyverse, or fastverse.
Think twice about your dependencies. Every dependency increases the risk of irreproducibility in the future. Prefer packages that are well-maintained and light on dependencies4. We also recommend you to read “When should you take a dependency?” by Wickham and Bryan (2023).
Fail early, often and noisily. Whenever you expect a certain state, use assertions to be sure. In R, you can use stopifnot() to make sure that a condition is actually true.
Test your code. Test your code with scenarios where you know what the result should be. Turn bugs you discovered into test cases. Use linting tools5 to identify common mistakes in your code, for example, the R package lintr.
Read through a style guide and follow it. A style guide is a set of stylistic conventions that improve the code quality. R users are recommended to read Wickham’s (2022) “Tidyverse style guide” and use the R package styler. Python users may benefit from reading the “Style Guide for Python Code” by Rossum, Warsaw, and Coghlan (2013). And even if you don’t follow a style guide, be consistent.
This is only a brief summary and there is much more to be learned about coding practices. If you want to dive deeper we recommend the following resources:
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”
— Martin Fowler, British software engineer
The Last Mile
renv only records the versions of R packages and of R itself. This means that potential system dependencies of R packages and other tools utilized in the project are not documented anywhere, including Quarto.6 We will manually write them down when creating a README. For now, however, there is one simple step you can take to record the version of Quarto (and a few other dependencies). Do run the following:
Terminal
quarto use binder
This will create a few additional files which facilitate reconstructing the computational environment in the future.7 As always, commit your changes:
Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).”The American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986.
Publication Manual of the American Psychological Association. 2020. 7th ed. Washington: American Psychological Association. https://doi.org/10.1037/0000165-000.
Raymond, Eric S. 2003. The Art of UNIX Programming: With Contributions from Thirteen UNIX Pioneers, Including Its Inventor, Ken Thompson. Addison-Wesley Professional Computing Series. Boston: Addison-Wesley. https://www.arp242.net/the-art-of-unix-programming/.
Rossum, Guido van, Barry Warsaw, and Alyssa Coghlan. 2013. “Style Guide for Python Code. Python Enhancement Proposals (PEPs).” August 1, 2013. https://peps.python.org/pep-0008/.
Vuorre, Matti, and Matthew J. C. Crump. 2021. “Sharing and Organizing Research Products as R Packages.”Behavior Research Methods 53 (2): 792–802. https://doi.org/10.3758/s13428-020-01436-x.
Note that the code provided does not alter the data file – no labels will be added to Data.csv. The labels are only added to a (temporary) copy of the data set within R in order to create the data dictionary.↩︎
An application programming interface provides the capability to interact with other software using a programming language.↩︎
You can use the function pak::pkg_deps() to count the total number of package dependencies in R.↩︎
A linting tool analyzes your code without actually running it. This process is called static code analysis.↩︎
As of August 2024, a proposal to record the version of Quarto has not been implemented, see rstudio/renv#1143.↩︎