2. Collect & Manage

Build quality into your data from the start

Data Collection Data Management Ethics & Privacy Quality Control

Checkpoint: Data quality validation meeting

2.1 Data Collection

Your data acquisition procedures must be documented in sufficient detail to allow replication by another researcher (see LMU Guidelines for Safeguarding Good Scientific Practice). Reproducible data collection processes build team expertise, reduce errors, and improve data quality and consistency.

State-of-the-art practices for reproducible data acquisition include:

creating standard operating procedures
recording metadata as data collection is taking place
build in automation through programming

Definition

Metadata are data about data; they provide context to your data. Metadata such as equipment settings, environmental conditions, software versions, and calibration records should be recorded contemporaneously, not reconstructed afterward. Electronic lab notebooks, instrument logs, and automated logging all help to document your metadata.

Tips for research groups to streamline this process

Create standard data acquisition procedures within the team. From step-by-step wet-lab procedure, to the settings of measuring devices and the reproducible data pre-processing using script, all regularly repeated steps should be documented and standardized to be replicated precisely by all team members.
Share standard operating procedures through common server space such as LRZ Sync & Share, specialized online tools like protocols.io or electronic lab notebooks, or LRZ GitLab for scripts.

Your protocol should specify materials with identifying details (e.g. lot numbers, versions, sources), equipment settings, step-by-step instructions with timing, and expected outcomes at each stage. What counts as “materials” varies by field: reagent concentrations in wet lab work, scanner parameters in neuroimaging, sampling coordinates in field ecology. But the principle is the same: enough detail that someone else could replicate your procedure exactly.

Write detailed methods and reusable protocols. Write your protocol before you start, with all the details that would be needed for an exact replication (see the ReproducibiliTeach lecture on reusable protocols)
Track deviations in real time. Follow your protocol precisely, and record any deviations as they happen. When you need to adapt, note it immediately. These deviations often explain unexpected results and guide protocol improvements. Electronic lab notebooks (ELNs) make this easier by creating version-controlled, timestamped records automatically, providing an audit trail that paper cannot match.
Publish your protocols. A detailed, tested protocol is a contribution to your field. Publishing establishes priority, enables citation, and makes your methods reusable. Platforms like protocols.io provide version control and DOI assignment.

LEARN MORE

Write reusable protocols

Understand the level of details needed for your protocols.

TOOLS & RESOURCES

Supported at LMU

eLabFTW

Electronic Lab Notebook hosted by LMU Munich.

Supported at LMU

Chemotion

ELN hosted by the Faculty for Chemistry and Pharmacy.

Protocols.io

Share, discover, cite, and improve research protocols.

Document both the instrument and the administration procedure completely. The data dictionary (or ‘codebook’) should record the exact version of the questionnaire used; if copyright allows it, document the item wording itself.

Questionnaires should be objective, reliable, and valid:

Objective: Results should not depend on who administers or scores the questionnaire.
Reliable: Responses should be consistent across repeated measurements when the underlying construct is unchanged (i.e., they should have low measurement error)
Valid: Items should measure the intended construct rather than something else. Typical ways to assess validity include content validity, convergent and divergent construct validity, and criterion-related validity.

For well-validated instruments, published studies extensively assess reliability and validity across several populations and contexts. If these quality criteria of measurement instruments are met, your effect size and statistical power will be increased.

When data comes from instruments, sensors, or APIs (Application Programming Interface), scripting the acquisition creates a reproducible record of exactly what was collected and how. Programming languages like R or Python work well for straightforward pipelines. For more complex multi-step workflows which are common in e.g. in bioinformatics and neuroimaging, workflow managers like Snakemake ensure steps run in the correct order and can resume after failures.

Structure data correctly from the start. Variables in columns, observations in rows. This makes your data immediately interoperable with analysis tools rather than requiring cleanup later. Scripts can also automate organization, file renaming, and conversion to open formats. See 2.2 Data Management for guidelines.
Keep records of what ran and when. Include error handling so failures are recorded rather than silently corrupting data. When something fails months later, you need to know what happened. Always test acquisition scripts on sample data before production runs. A bug in your collection pipeline can invalidate an entire dataset.
Version control your code and data. This makes your methods reproducible and shareable. See 2.2.6. Version Control for details.

LEARN MORE

OSC Tutorial

Introduction to R

Programming fundamentals for research data processing (3h)

OSC Tutorial

Introduction to Git

Learn Git basics integrated with RStudio (2h).

Field and sometimes lab researchers have to work with analogue notebooks first, and use temporary storage solution for images and video-recordings.

Digitalize your data soon after data collection, e.g. using data entry form from a relational database such as postgreSQL
Create automatic validation checks for data entry so e.g. values outside of expected range get immediately identified (see 2.4. Quality control)
Transfer digital data (e.g. pictures, video recordings) from temporary storage solutions (e.g. camera storage) to permanent storage solutions (see 2.1.1. Storage)
Check your data entries, e.g. after a day, to correct possible mistakes in transcription
Keep analogue records to correct data entries errors at the end of the season or experiment

TOOLS & RESOURCES

postgreSQL

Open source relational database software

A high-quality systematic review, whether it includes a meta-analysis or not, uses a rigorous, transparent, and reproducible methodology to select the relevant literature and later provide a thorough summary and critical evaluation of research within a given field. Its defining features include:

clearly defined objectives supported by an explicit and reproducible methodology
a comprehensive, systematic search designed to identify all studies meeting the eligibility criteria
a critical appraisal of the validity of included studies, such as through risk-of-bias assessment
a structured presentation and synthesis of the characteristics and findings of the included studies

Data collection, in this context, consists in developing a judicious reproducible methods to selecting relevant literature.

Use tools such as
- CAMARADES which provide a supporting framework for the conduct of systematic reviews of animal studies, and
- PRISMA (Preferred Reporting Items for Systematic review and Meta-Analysis Protocols) and its extensions such as PRISMA-P which offers checklist for systematic review protocol reporting.
Learn more from our colleagues at the Berlin Institute of Health QUEST Center for Responsible Research using their resources for systematic reviews in biomedical research and the material of their systematic review workshop at our previous OSC summer school.

LEARN MORE

OSC Workshop

Systematic review

Slides and additional resources of workshop delivered at OSSS23.

TOOLS & RESOURCES

PRISMA

Preferred Reporting Items for Systematic review and Meta-Analysis Protocols

CAMARADES

Step-by-step guide to Preclinical Systematic Review.

2.2 Data Management

In 1. Plan & Design you created a Research Data Management Plan. It’s now time to put this plan into practice, refining it as you learn what actually works for your project.

Keep your raw data as read-only files. The unmodified output of your instruments, surveys, or observations should never be modified directly. You should provide enough documentation to recreate your processed datasets and results from your raw data (see 3.1. Data Processing & Analyses).
Organize, document, and store your research files so they remain usable, and become FAIR upon sharing. Beyond raw data, you will generate processed data, code, documentation, and metadata. How you organize, describe, and store these determines whether your work remains usable and reproducible. The FAIR principles guide these decisions: making outputs Findable, Accessible, Interoperable, and Reusable.

What follows are general practices. Your domain has specific conventions for file formats, folder structure, and metadata. RDMkit provides detailed guidance organized by research area.

Tips for research groups to streamline this process

Create standard operating procedures (SOPs) within the team. For instance: where and when data back up should be made and in which file format, what project folder structure and conventions for naming files should be adopted, which metadata should routinely be acquired, what documentation should be created and when, and how and where a history of versions should be preserved.
Assign responsibilities. For developing such SOPs for a specific kind of project and for checking all members have implemented those SOPs at a specified time point in their project (e.g. hold a data quality validation meeting prior to starting data analyses, see Collect & Manage Checklist).

This section covers storage for data you are actively collecting. Long-term archiving for sharing is covered in 4. Preserve & Share.

Use institutional storage. LMU Munich provides storage such as LRZ Sync and Share or LRZ DSS with automated backups, access controls, and GDPR compliance. Additional options vary by department. Contact the Research Data Management team of the University Library to find what is available to you. When choosing, consider how much data you will generate, who needs access, and whether your data includes personal information requiring stricter controls.
Follow the 3-2-1 backup rule. Keep three copies on two media types with one off-site. Designate one location as the master copy, the authoritative version everything else syncs from. Working with multiple “equal” copies creates version conflicts. Remember that syncing is not backup: if you delete a file from a synced folder, the deletion propagates everywhere. True backups preserve previous versions independently.
Control access from the start. Grant access only to those who need it. Use institutional sharing tools, not email attachments or personal cloud links. For collaborations, agree at the start who can read, who can edit, and who manages permissions. When team members leave, remove their access promptly.
Test your backups. A backup you cannot restore is not a backup. Test restoration at least once. Archive inactive data periodically and review access lists when team composition changes.

Avoid for Research Data

Personal laptops as primary storage, external drives as only copy, consumer cloud services (Dropbox, Google Drive) for sensitive data, and USB drives except for temporary transport.

LEARN MORE

Supported at LMU

LRZ Sync & Share

Cloud storage service for LMU Munich

Supported at LMU

LRZ DSS

Long-term archival storage for LMU Munich

OSF

Research project management platform including storage.

Your folder structure and file naming conventions determine whether you and others can navigate your project months or years later. Establish these conventions at the start of your project and document them. When collaborating, ensure everyone follows the same system.

Separate raw from processed data. Raw data should not be modified: once collected, these files should never be touched. All cleaning, data processing, transformations, and analyses happen on copies or through scripts in a separate folder. This preserves your ability to verify results or reprocess from the original source.
Develop a file naming convention. Good file names identify contents at a glance and sort correctly. Balance specificity with readability: too many elements make names unwieldy, too few make them ambiguous. Order elements from general to specific.
- Use underscores or hyphens to separate elements, never spaces or special characters (? ! & * % # @)
- Use ISO 8601 dates (YYYY-MM-DD) so files sort chronologically
- Include version numbers with leading zeros (v01, v02) so v10 sorts after v09
- Use meaningful abbreviations and document what they mean

A pattern like YYYY-MM-DD_project_condition_type_v01.ext places files in chronological order while preserving context. For example, 2024-03-15_sleep-study_control_survey_v02.csv immediately tells you when it was created, which project it belongs to, the experimental condition, data type, and revision. Document your convention in a README file stored next to your data files so collaborators can parse filenames without asking.

Follow domain standards where they exist. Many fields have established organizational conventions that tools and collaborators expect. Using these means your data can immediately be integrated in existing analysis pipelines and reviewers recognize the structure. Search RDMkit for standards in your domain.

LEARN MORE

OSC Tutorial

Data Organization

Folder structure and naming conventions. (30 min)

TOOLS & RESOURCES

OSC Tool

Research Project Template

Project folder separating raw & processed data, code, and outputs.

RDMkit

Domain-specific data management standards

File format choices affect who can work with your data now and whether it remains readable in the future. Open formats have publicly documented specifications that anyone can implement, so many programs can read them and they remain accessible even if the original software disappears. Proprietary formats lock you into specific tools, complicate collaboration, and risk becoming unreadable if the company stops supporting them.

Keep raw data in its original format. Whatever your instrument or source produces, preserve that original as your ground truth. Even if it is proprietary, you need it for verification and potential reprocessing.
Work in open formats. For analysis, convert to open formats like CSV, JSON, or plain text. This makes your workflow reproducible, enables collaboration across different tools, and ensures your data can be shared. If conversion loses important information (metadata, precision, structure), document what is lost and keep both versions.
Be careful with spreadsheets. Excel is convenient for data entry but causes real problems. It silently converts data: gene names like MARCH1 become dates, leading zeros in IDs disappear, and long numbers lose precision. Formatting (colors, merged cells) breaks machine-readability since scripts cannot see it. If you use spreadsheets for entry, keep them simple (one header row, one observation per row, no merged cells) and export to CSV immediately. Save CSVs with UTF-8 encoding to avoid character corruption when sharing across systems. For more guidance on spreadsheet best practices, see The Turing Way and UC Davis DataLab.
Check domain recommendations. Your field likely has established conventions balancing openness with practical needs like performance or metadata preservation. Consult the RDMKit to find conventions for your field.

Format issues often surface during quality control. The 2.4. Quality Control panel below covers validation checks that can catch encoding problems, unexpected conversions, and structural inconsistencies early.

TOOLS & RESOURCES

RDMkit

Domain-specific file formats and conventions

Without documentation, a dataset is just a collection of files. Six months from now, you will not remember what each column means, why certain values are missing, or how files relate to each other. Documentation makes your data usable by your future self, your collaborators, and any other researchers.

Create a README file (as .md or .txt) early and update it as you go. Your README is the entry point to your project. Start it when you begin, not when preparing to publish. A good README answers the essential questions: who created the data, what it contains, when and where it was collected, why it was generated, how it was produced, and whether it can be reused. These answers let someone unfamiliar with your project understand and work with your data.
Create a data dictionary defining every variable. A data dictionary (or “codebook”) makes your dataset self-explanatory. For each variable, document what it measures, its data type, valid values, units of measurement, and how missing data is coded. Use appropriate missing codes to distinguish why data is absent (declined to answer, not applicable, technical failure) since this distinction matters for analysis.

LEARN MORE

OSC Tutorial

Principles of Data Documentation

Principles of README files and data dictionaries.

OSC Tutorial

Data Documentation & Validation

Create READMEs, data dictionaries, and validation checks for your data. (1h)

Standards are community agreements on how to organize and describe research data. Using them means others in your field immediately understand your data. Three types of standards matter here:

Organizational standards specify how to structure files and folders. Some fields have well-established conventions, like BIDS for neuroimaging data. When such standards exist, use them. Your data will work immediately with existing tools, and collaborators will recognize the structure without explanation. If no standard exists for your domain, create a consistent structure and document it in your README. See for instance our research project template.
Reporting guidelines specify what methodological details to document for different study types. The EQUATOR Network maintains a searchable database of guidelines for clinical trials, observational studies, animal research, and many other study types. Following these ensures you capture everything others need to understand or replicate your work.
Metadata standards define what descriptive information to record and how to structure it. Scientific metadata describes how your data was produced: equipment specifications, acquisition parameters, protocols followed. This is distinct from discovery metadata (titles, keywords, descriptions) which you will prepare when sharing in 4. Preserve & Share. Your field has conventions for which parameters matter. FAIRsharing catalogs metadata standards by discipline.

Think of your data as a first-class research output. Comprehensive metadata transforms a project artifact into a reusable resource. Someone reanalyzing your data years later needs to understand exactly how it was produced.

TOOLS & RESOURCES

EQUATOR Network

Comprehensive database of reporting guidelines.

FAIRsharing

Search by discipline to find metadata standards, reporting guidelines, and data policies for your field.

RDMkit

Domain-specific metadata standards

OSC Tool

Research Project Template

Project folder separating raw & processed data, code, and outputs.

Version control system like Git tracks changes to files over time. You can see what changed, when, and why. You can revert to previous versions. Collaborators can work without overwriting each other.

Git usually suffices for data files. Text-based formats (CSV, JSON, plain text) and smaller binary files work well in standard Git repositories. You get a complete history of changes and can share easily via GitHub or GitLab.
Use specialized tools for large or frequently changing binary files. Standard Git stores each version in full, so repositories become unwieldy with large datasets. Git LFS (Large File Storage) stores large files separately while keeping them tracked. Git-annex manages files across multiple storage locations. DataLad builds on git-annex and works with standard Git workflows.

Difference between Git, GitHub and GitLab

Git is a version control system that tracks changes in text files (e.g. CSV, plain text, R, Python). The Git software and your Git repositories should be, respectively, installed and located in your local environment (i.e. on your computer, not on a drive, see Git tutorial).
GitHub is the most popular, free but proprietary and US-based cloud-based platform for software development with Git, providing collaboration features like pull requests and issues (see GitHub tutorial). You should not have any sensitive data on GitHub even in a private repository.
LRZ GitLab is a cloud-based hosting platform that provides essentially the same features as GitHub but is open source and is installed on the LRZ servers for LMU Munich and can therefore be considered secure when the repository is private.

While your LRZ GitLab account is associated with your LMU Munich affiliation, your GitHub account can be associated with your private email, be included in your CV, and be used for public sharing of your data and code (see 3. Analyze & Collaborate and 4. Preserve & Share).

In a version controlled workflow, you back up your local Git repositories on either GitHub or LRZ GitLab through a secure SSH connection (see GitHub tutorial) and share access to your repositories with your collaborators through the cloud-based platform GitHub or LRZ GitLab.

LEARN MORE

OSC Tutorial

Introduction to Git

Learn Git basics integrated with RStudio (2h).

OSC Tutorial

Introduction to GitHub

Connect to GitHub from Git within RStudio (1h).

TOOLS & RESOURCES

Supported at LMU

LRZ Gitlab

Institutional Git hosting for LMU Munich.

DataLad

Version control for large datasets

2.3 Ethics & Privacy

Research involving human participants requires ethics approval and data protection compliance. In your ethics proposal (see 1.2.3. Ethics) you planned for safeguards for the people contributing to your research and you now need to implement them before or while collecting your data.

Participants have the right to understand what they are agreeing to. Your consent form should explain the research purpose in plain language, describe what data you will collect and how you will protect it, specify who will have access and for how long, and make clear that participation is voluntary. See “How to write an informed consent form” from the University to Utrecht and an example template from the LMU Psychology Department.

Use tiered consent when you plan to share data. Some participants may consent to their data being used for your study but not shared publicly. Others may be comfortable with broader sharing. Giving options respects autonomy while maximizing what you can eventually share.
Store consent forms separately from data. The consent form links a name to participation. Keeping it with your data undermines any pseudonymization you apply.

Tips for research groups:

Maintain a centralized log of all ethics approvals, consent forms, and compliance certifications for easy reference by team members.

TOOLS & RESOURCES

How to write an informed consent form

University of Utrecht RDM guide.

Anonymization protects privacy and determines what you can share.

Remove direct identifiers during collection. Names, addresses, ID numbers, photographs, email addresses. Replace these with codes.
Assess indirect identifiers carefully. A combination of age, location, profession, and a rare condition might identify someone even without their name. Timestamps reveal patterns. Free-text responses often contain identifying details participants did not intend to share. Follow our data anonymization tutorial to learn to evaluate and implement anonymization techniques in R.
Generate synthetic data when full anonymization or real data sharing is not possible. Synthetic data is artificially generated data that can serve as a privacy-preserving alternative to sensitive datasets, enabling researchers to reproduce analyses, verify findings, and initiate model development when access to real data is restricted. Sharing synthetic data together with code, rather than sharing no data at all, increases the utility of your research (see 4. Preserve & Share).

Distinction between pseudonymization and anonymization

Pseudonymization replaces identifiers with codes while retaining a key that links back to individuals. Pseudonymized data is still personal data under GDPR because re-identification is possible, at least by some persons.
Anonymization removes all possibility of re-identification. Only truly anonymized data falls outside GDPR scope. Achieving this is harder than it appears, especially with rich datasets.

LEARN MORE

OSC Tutorial

TBA: Data Anonymization

Implement data anonymization techniques in R. (X h)

OSC Tutorial

Synthetic Data

Synthetic data creation in R to balance utility and privacy when sharing data. (3h)

Research at LMU Munich must comply with EU data protection regulations. The core principles: have a lawful basis for processing personal data (usually consent or legitimate research interest), use data only for stated purposes, collect only what you need, delete data when you no longer need it, and protect it against unauthorized access.

In practice: document your lawful basis, include data protection language in consent forms, use institutional storage rather than personal cloud services, restrict access to those who need it, and plan when and how you will delete data.

For data protection guidance, contact the LMU Data Protection Officer or the Research Data Management team of the University Library.

2.4 Quality Control

Quality control catches problems before they propagate into your analysis. The practices here ensure your data is trustworthy and your exclusions are well-founded.

Define criteria before looking at your data. This prevents unconscious bias in what you keep and exclude, and demonstrates that your decisions are principled rather than convenient (see 1.4. Study Design & Analysis Plan).

Validation checks whether your data meets specifications. For example, values can be verified to fall within valid ranges (e.g., age ≥ 0), required fields can be checked for missing entries, categorical variables can be restricted to allowed levels (e.g., “yes/no”), dates can be validated for proper format, and cross-variable constraints can be enforced (e.g., discharge date ≥ admission date). Run checks during collection to catch problems immediately, after collection for systematic review, and after any processing to verify transformations worked correctly.

Automate what you can. Check that data types are correct, values fall within expected ranges, required fields are populated, and formats are consistent. These checks should run automatically and flag problems for review.
Catch what automation misses with manual review. Sample your data and verify it against the source. Inspect outliers to determine whether they are errors or genuine extreme values. Look for suspicious patterns: survey responses that alternate predictably, reaction times that are impossibly fast.

LEARN MORE

OSC Tutorial

Data Documentation & Validation

Create validation rules and automated checks for your research data. (1h)

Data cleaning handles errors, inconsistencies, and missing values. The cardinal rule: never modify your raw data. All cleaning happens on copies, ideally by scripts and not through manual changes.

Correct unambiguous errors. Clear typos, obvious data entry mistakes. For ambiguous cases, flag them for review rather than making assumptions. Document your reasoning for every judgment call.
Handle missing data consistently. Decide on a coding scheme (NA, -999, blank) and apply it uniformly. When you know why data is missing, record that information. It may matter for analysis.
Investigate outliers before acting. An extreme value might be an error, or it might be genuine. Understand the cause before deciding whether to remove, transform, or retain it.
Write cleaning as a script. A script documents exactly what you did and lets you reproduce it. Keep a decision log for choices that cannot be automated.

Exclusion criteria specify which data points will be removed from analysis and why. Define these before you see your results (see 1.4. Study Design & Analysis Plan).

Review common exclusion criteria. Technical failures (equipment malfunction, incomplete recording), protocol violations (wrong procedure followed, participant did not comply), quality thresholds (too much missing data, failed attention checks), and participant criteria (did not meet stated inclusion criteria).
Document everything. Record criteria before analysis begins. Report how many data points were excluded for each criterion in a log with the date and reviewer’s identity. Plan sensitivity analyses comparing results with and without exclusions to show your findings are robust.

2.1 Data Collection

LEARN MORE

Write reusable protocols

TOOLS & RESOURCES

eLabFTW

Chemotion

Protocols.io

LEARN MORE

Introduction to R

Introduction to Git

TOOLS & RESOURCES

postgreSQL

LEARN MORE

Systematic review

TOOLS & RESOURCES

PRISMA

CAMARADES

2.2 Data Management

LEARN MORE

LRZ Sync & Share

LRZ DSS

OSF

LEARN MORE

Data Organization

TOOLS & RESOURCES

Research Project Template

RDMkit

TOOLS & RESOURCES

RDMkit

LEARN MORE

Principles of Data Documentation

Data Documentation & Validation

TOOLS & RESOURCES

EQUATOR Network

FAIRsharing

RDMkit

Research Project Template

LEARN MORE

Introduction to Git

Introduction to GitHub

TOOLS & RESOURCES

LRZ Gitlab

DataLad

2.3 Ethics & Privacy

TOOLS & RESOURCES

How to write an informed consent form

LEARN MORE

TBA: Data Anonymization

Synthetic Data

2.4 Quality Control

LEARN MORE

Data Documentation & Validation

Collect & Manage Checklist