3. Analyze & Collaborate
Create reproducible analyses and collaborate effectively
3.1 Data Processing & Analysis
3.1 Data Processing & Analysis
Data processing and analysis should be reproducible, independent of software, programming language, or operating system. This is best achieved by favoring automated script-based workflows (over manual point-and-click procedures), and supported by adequate documentation and shared code that allows others to regenerate results. Because analysis scripts are inherently iterative and evolve through corrections and refinements, automation is essential for both reproducibility and efficiency. This section outlines a recommended workflow for R users.
- Create a self-contained project folder. Include data, code, documentation, and outputs in a single structured environment, ensuring the project remains understandable, reproducible, and portable across systems and collaborators. If you use the free and open source software RStudio to manage your R project, your project directory (or folder) should contain a .Rproj file (see R tutorial). Use relative paths (i.e.
“./subfolder”, where.represents the root of your .Rproj directory) or the libraryhere, so the project stays portable to another environment - Use a standard folder structure. Your code repository should include a standard folder structure that make sense for your type of research, ideally shared across your team members. You can for instance use our research project template.
- Stop clicking, start coding. Automatize all possible steps, including data acquisition (see 2.1. Data Collection), data processing and transformation, data analyses, data visualization, and results reporting (see 3.2. Reporting Results)
- Structure, comment, and standardize your scripts. R scripts themselves should follow current standards to increase their readability (see Readable Code Lecture). Use meaningful names for variables, functions, and scripts. Add comments to your code explaining why you made a decision, any known limitations to your code, and citations of methods. Do not include sensitive information such as credentials or name of excluded patient as comments in your code!
- Define your own functions rather than copy pasting pieces of code which makes it hard to maintain error-free. Functions are ‘self-contained’ sets of commands that accomplish a specific task. They usually ‘take in’ data or parameter values (these inputs are called ‘function arguments’), process them, and ‘return’ a result. See our R tutorial and data simulation tutorial for examples.
- Set seeds for random processes to enable exact replication. A seed is a number used to initialize a pseudorandom number generator algorithm. It serves as the starting point for a sequence of numbers that appear random but are actually produced by a deterministic, fixed algorithm. See e.g. our data simulation tutorial for examples.
- Follow a style guide to increase readability. Use automated styling tools (e.g.
styler,lintR).
- Use LRZ Compute Cloud for data-intensive analyses. LRZ Supercomputing provide virtual machines, high-performance computing, and storage to researchers of LMU Munich.
LEARN MORE
TOOLS & RESOURCES
Version control tracks changes to files over time. You can see what changed, when, and why. You can revert to previous versions. Collaborators can work without overwriting each other.
In a version controlled workflow, you back up your local Git repositories on the cloud-based platforms GitHub or LRZ GitLab and share access to the online version of your repositories with your collaborators.
Learn to create git branches to collaborate on a same piece of code in a unique repository. i.e. temporary copies where you can work without breaking the original source code, which you later merge back to the main branch (see Advanced Git tutorial).
Maintain efficient communication to coordinate collaborative work. GitHub or GitLab facilitate collaboration through “issues” (a precise description of something to fix), “discussions” (asynchronous thinking through figuring out how to resolve a problem), and can still very well resolve “conflicts” (i.e. collaborators wanting to merge changes on the exact same line of script). To complement this, LMU Munich offers LMU chat (Matrix), a secured open source chat service with all LMU members on which you can also invite external collaborators.
- Git is a version control system that tracks changes in text files (e.g. CSV, plain text, R, Python). The Git software and your Git repositories should be, respectively, installed and located in your local environment (i.e. on your computer, not on a drive, see Git tutorial).
- GitHub is the most popular, free but proprietary and US-based cloud-based platform for software development with Git, providing collaboration features like pull requests and issues (see GitHub tutorial). You should not have any sensitive information on GitHub even in a private repository.
- LRZ GitLab is a cloud-based hosting platform that works exactly the same as GitHub but is free and open source and is installed on the LRZ servers for LMU Munich and can therefore be considered secure when the repository is private.
While your LRZ GitLab account is associated with your LMU Munich affiliation, your GitHub account can be associated with your private email, be included in your CV, and be used for public sharing of your data and code (see 4. Preserve & Share).
In a version controlled workflow, you back up your local Git repositories on either GitHub or LRZ GitLab through a secure SSH connection (see GitHub tutorial) and share access to your repositories with your collaborators through the cloud-based platform GitHub or LRZ GitLab.
If you work with sensitive data, you must not include the raw or processed data in the version-controlled repository that will end up being shared publicly.
Instead, explicitly exclude the data directory using the .gitignore file from the start, or, at the time of sharing, create a new local repository that contains all project files except the data.
Importantly, if data are removed from an existing repository, they may still remain accessible in the repository’s history, since previous states of the project can be restored. If sensitive data are accidentally committed and pushed, it is possible to rewrite the repository history to remove them retrospectively. However, this process is complex and error-prone, so it is best avoided by ensuring that sensitive data are excluded from version control from the outset.
Create a LRZ GitLab “organization” for the team. This allows repositories, permissions, and project resources to be managed centrally rather than under individual accounts. This ensures continuity when team members leave, as ownership can be transferred to e.g. the PI and other administrators of the organization.
LEARN MORE
TOOLS & RESOURCES
Manage your computational environment by explicitly recording the software, package versions, and dependencies required for your analyses, ensuring results can be reproduced across systems and over time. Tools such as packages managers (e.g. Renv for R packages, Conda for Python packages) or broader containers (e.g. Docker or Binder) help stabilize workflows and prevent inconsistencies caused by packages or software updates.
For a R project repository:
- Activate Renv to keep track of all packages versions (see our renv tutorial). This way, you or someone else can reproduce your results on another computer or at a later time using the same R packages versions.
Before publishing your project (see 4. Preserve & share):
- Record your dependencies in your README file for possible reconstruction with repo2docker or binder (see Code Publishing tutorial).
LEARN MORE
TOOLS & RESOURCES
As with all documentation, your project repository’s documentation should be written early - initially for your near-future self to support efficient re-engagement after interruptions, then revised for internal team review, and ultimately expanded and refined for public sharing (see 4.2. Open Source Code).
- Create a README (e.g. a .md or .txt file) early and update it as you go. Your README is the entry point to your project. A good README answers the essential questions: who created the script, what it contains, how they relate to other scripts and in which order scripts should be run, what the dependencies of the project are, how to obtain/access the input data, whether the code can be reused.
- Annotate your code explaining why you made a decision. All parameter values used as input for a function, or other decisions, should be justified minimally as comments in your code to later be included in your manuscript. Do not include sensitive information such as credentials or name of excluded patient as comments in your code!
- Update your data dictionary and README files. Your data documentation should be updated to include all data exclusion, change in range of possible values, ect. (see 2.2.4. Documentation).
Example README to allow team members to review your code:
# Analysis of Treatment Effects
## Requirements
- R version 4.3+
- Packages listed in renv.lock
## Running the Analysis
1. Install dependencies: `renv::restore()`
2. Run scripts in order: 01_preprocessing.R, 02_analysis.RLEARN MORE
3.2 Reporting Results
3.2 Reporting Results
Your results should be computationally reproducible to ensure that the analysis can be independently verified. They should also be reported comprehensively and in line with reporting guidelines to facilitate statistical checks and future meta-analyses.
A practical way to make results and analysis decisions transparent and traceable is to use literate programming, which combines narrative text explaining the logic of the analysis with executable code that performs data processing and statistical procedures. When the document is compiled, outputs such as tables, figures, and references are generated directly from the code and update automatically whenever the code changes, ensuring that the reported methods and results remain consistent with the analysis. This eliminates repeated copy-and-paste and reduces the risk of uncertainty about which version of, for example, a figure is actually included in the manuscript.
For users of R, Python, and Julia, Quarto has become the standard tool for creating reproducible reports, superseding R Markdown and enabling rendering to formats such as HTML, PDF, and Word (see our Quarto tutorial).
Reproducible Analysis Report: The Minimum Standard for Computational Reproducibility
- Create an analyses report with Quarto. To make your results reproducible, you should provide the analysis code that creates the results of the manuscript with explanations for each of the steps. Quarto is an ideal tool for such purpose (see Quarto tutorial). Your analysis report should contain your research question, data loading instructions, preprocessing steps, statistical procedures, tables and figures, session info and software versions (in the report itself or in your README).
This will facilitate your own revisions, code-review by your team, reproducibility checks that some journals conduct as part of peer-review, and verification of results reproducibility by other researchers. The code repository can be published with simulated, synthetic, or real data, depending on the project, together with the respective article (see 4.2. Open Source Code).
Creating a reproducible analysis report in Quarto currently represents state-of-the-art practice for demonstrating the computational reproducibility of a study, and we recommend adopting this approach as a minimum standard.
Quarto can also be used to generate additional reporting outputs, such as manuscripts, presentations, and websites, in a fully reproducible manner. However, adopting a fully reproducible workflow for all outputs will require additional coordination and time, particularly when working with collaborators who use different tools or workflows.
Reproducible Manuscript: The Ultimate Reproducible Report
If you do not want to break the reproducibility chain by copy and pasting new results, tables, and figures into e.g. a Word document, you can write your entire manuscript in Quarto.
Include bibliographic references in your Quarto document. You can directly source .bib files into your Quarto document or use the open source reference management software Zotero integrated in RStudio to cite articles and have them automatically formatted in any journal standards (see Quarto tutorial and Zotero tutorial).
Use a template document for major formatting aspects. You can use e.g. a Word template to create your output with e.g. specific header and legend formatting. Some journals offer Quarto templates for further formatting requirements (see list of Quarto extensions). Some journals offer LaTeX or Word templates; you can adapt these by modifying Quarto’s YAML header and markdown content to create your own Quarto template.
Adapt your workflow to your collaborators. Ideally, all your team mates have also adopted the use of Git and GitHub or LRZ GitLab and the use of issues and branches to work collaboratively (see 3.1.2. Version Control). A compromise with collaborators who do not use Git is to render your draft manuscript to e.g. Word and share it through cloud-based collaborative document editing e.g. LRZ Sync & Share or Google Docs with collaborators using the suggestion / track changes mode. Then, the lead author transcribes all edits manually, or with the help of e.g. R packages such as
trackdownorofficer, and addresses all comments in their quarto version before re-rendering for a second round of revisions, potentially with additional analyses.
Reproducible Presentations & Websites: Further Reproducible Outputs for Outreach
Using the same Quarto environment, you can also:
Convert your minimal analyses report into a reproducible presentation. To get feedback on your analyses, you can output a report into a slide presentation with presenter notes. Slides remain dynamically linked to the underlying analyses. Any modification to the data or code automatically propagates to tables, visualizations, and reported results, eliminating inconsistencies between presentation and computation.
Create reproducible website to support your research or teaching. Quarto also enables the creation of reproducible and interactive websites by linking content directly to data, code, and analyses within a unified publishing workflow. Pages, figures, tables, and teaching materials update automatically when the underlying sources change, ensuring consistency across outputs. This approach simplifies maintenance, and allows websites to evolve as living, version-controlled resources rather than static collections of files. Example of website, interactive reports and other formats can be explored on the Quarto Gallery. This website and our self-paced tutorials are all Quarto websites.
LEARN MORE
TOOLS & RESOURCES
Transparent and complete reporting is essential for the interpretation, verification, and reuse of scientific results. To support this, many disciplines have developed reporting guidelines that specify the minimum information that should be included when describing study design, data collection, analysis, and results. These guidelines help ensure that studies can be critically evaluated, replicated, and included in evidence syntheses such as systematic reviews and meta-analyses.
A central resource for such guidelines is the EQUATOR Network, an international initiative collecting and promoting reporting standards across many study types and disciplines.
Typical items that should be reported include:
- Study design and setting – the type of study (e.g., randomized trial, cohort study), where and when it was conducted, and relevant contextual factors.
- Participants or data sources – eligibility criteria, recruitment or sampling procedures, sample size, and reasons for exclusions or dropouts.
- Variables and measurements – definitions of exposures, outcomes, predictors, and covariates, including how and when they were measured.
- Sample size justification – power calculations or other reasoning behind the chosen sample size.
- Statistical methods – the models used, assumptions checked, handling of missing data, variable transformations, and any sensitivity or robustness analyses.
- Data preprocessing and analysis workflow – steps such as cleaning, filtering, or derived variables that affect the final dataset used for analysis.
- Results with appropriate uncertainty – effect estimates, confidence intervals, p-values, and clear descriptions of the comparisons performed.
- Participant flow and descriptive statistics – numbers of observations at each stage and summary statistics of the analyzed sample.
- Limitations and potential sources of bias – issues such as confounding, measurement error, or selection bias.
- Data, code, and materials availability – where readers can access datasets, analysis scripts, or supplementary materials to reproduce the analysis.
A lot of these elements are those of a preregistration (see 1.4.1. Pre-analysis planning). Describe them in your methods section, adding they were planned a priori in your preregistration and citing its DOI, and separate results into e.g. preregistered confirmatory analyses and non-preregistered exploratory analyses.
Some of these elements will be refined depending on your study type. Widely used examples of guidelines include CONSORT for randomized controlled trials, ARRIVE for animal studies, and PRISMA for systematic reviews.
Analyze & Collaborate Checklist
This assumes a reproducible workflow using R. Not all items are relevant for all fields of research or study types.
1. During data analyses: create and maintain a repository
Repository Setup
Repository Content
R Scripts
2. Before presenting results to the research group: conduct internal code peer review
3. Before sharing outside the group