• Home
  • About
  • People
    • Management
    • Staff
    • Scientific Board
    • Advisors
    • Members
    • Fellows
  • Partners
    • Institutional Members
    • Local Open Science Initiatives
    • Other LMU Support Services
    • External Partners
    • Funders
  • Training
  • Events
  1. Training Tracks
  2. Research Cycle Handbook
  3. Collect & Manage

Thanks for visiting the LMU Open Science Center–our site is still under construction for now, but we’ll be live soon!

  • Training Tracks
    • Self-Learning Catalog
      • Principles
        • Assessing Research Replicability
        • Credible Science
        • Replicability Crisis
      • Study Planning
      • Data Management
        • Maintaining Privacy with Open Data
        • Introduction to Open Data
      • Reproducible Processes
        • Advanced Git
        • Readable Code
        • Reproducible Protocols
      • Publishing Outputs
        • Open Access, Preprints, Postprints
    • Research Cycle Handbook
      • Plan & Design
      • Collect & Manage
      • Analyze & Collaborate
      • Preserve & Share
    • Educator Toolkit
  1. Training Tracks
  2. Research Cycle Handbook
  3. Collect & Manage

2. Collect & Manage

4. Preserve & Share3. Analyze & Collaborate2. Collect & Manage1. Plan & Design

Build quality into your data from the start

Data Collection Data Management Ethics & Privacy Quality Control
Checkpoint: Data quality validation meeting

2.1 Data Collection

Your data acquisition procedures must be documented in sufficient details to allow replication by another researcher (see LMU Guidelines for Safeguarding Good Scientific Practice). Establishing reproducible data collection processes strengthens the quality and consistency of both your data and that of your team.

State-of-the-art practices for reproducible data acquisition include:

  • creating standard operating procedures
  • recording metadata as data collection is taking place
  • build in automation through programming
NoteDefinition

Metadata are data about data; they provide context to your data. Metadata such as equipment settings, environmental conditions, software versions, and calibration records should be recorded contemporaneously, not reconstructed afterward. Electronic lab notebooks, instrument logs, and automated logging all help to document your metadata.

TipTip for research groups to streamline this process
  • Create standard data acquisition procedures within the team. From step-by-step wet-lab procedure, to the settings of measuring devices and the reproducible data pre-processing using script, all regularly repeated steps should be documented and standardized to be replicated precisely by all team members.
  • Share standard operating procedures through e.g. common server space such as LRZ Sync & Share, specialized online tools like protocols.io or electronic lab notebooks, or LRZ GitLab for scripts.
  • 2.1.1. Lab Protocols
  • 2.1.2. Questionnaires
  • 2.1.3. Software-based data acquisition

Your protocol should specify materials with identifying details (lot numbers, versions, sources), equipment settings, step-by-step instructions with timing, and expected outcomes at each stage. What counts as “materials” varies by field: reagent concentrations in wet lab work, scanner parameters in neuroimaging, sampling coordinates in field ecology. But the principle is the same: enough detail that someone else could replicate your procedure exactly.


  • Write detailed methods and reusable protocols. Write your protocol before you start, with all the details that would be needed for an exact replication (see the ReproducibiliTeach lecture on reusable protocols)

  • Track deviations in real time. Follow your protocol precisely, and record any deviations as they happen. When you need to adapt, note it immediately. These deviations often explain unexpected results and guide protocol improvements. Electronic lab notebooks (ELNs) make this easier by creating version-controlled, timestamped records automatically, providing an audit trail that paper cannot match.

  • Publish your protocols. A detailed, tested protocol is a contribution to your field. Publishing establishes priority, enables citation, and makes your methods reusable. Platforms like protocols.io provide version control and DOI assignment.

LEARN MORE

reproducibiliteach icon

Write reusable protocols

Understand the level of details needed for your protocols.

TOOLS & RESOURCES

eLabFTW icon
Supported at LMU

eLabFTW

Electronic Lab Notebook hosted by LMU Munich.

Chemotion icon
Supported at LMU

Chemotion

ELN hosted by the Faculty for Chemistry and Pharmacy.

Protocols.io icon

Protocols.io

Share, discover, cite, and improve research protocols.

Document both the instrument and the administration procedure completely. Use validated instruments when possible, pilot test before deployment, and archive the exact version participants see in an open source format.

When data comes from instruments, sensors, or APIs, scripting the acquisition creates a reproducible record of exactly what was collected and how. Programming languages like R or Python work well for straightforward pipelines. For more complex multi-step workflows which are common in e.g. in bioinformatics and neuroimaging, workflow managers like Snakemake ensure steps run in the correct order and can resume after failures.


  • Structure data correctly from the start. Variables in columns, observations in rows. This makes your data immediately interoperable with analysis tools rather than requiring cleanup later. Scripts can also automate organization, file renaming, and conversion to open formats. See 2.2 Data Management for guidelines.

  • Keep records of what ran and when. Include error handling so failures are recorded rather than silently corrupting data. When something fails months later, you need to know what happened. Always test acquisition scripts on sample data before production runs. A bug in your collection pipeline can invalidate an entire dataset.

  • Version control your code and data. This makes your methods reproducible and shareable. See 2.2.6. Version Control for details.

LEARN MORE

LMU OSC logo
R logo
OSC Tutorial

Introduction to R

Programming fundamentals for research data processing (3h)

LMU OSC logo
Git logo
OSC Tutorial

Introduction to Git

Learn Git basics integrated with RStudio (2h).

2.2 Data Management

In 1. Plan & Design you created a Research Data Management Plan. It’s now time to put this plan into practice, refining it as you learn what actually works for your project.

  • Keep your raw data as read-only files. The unmodified output of your instruments, surveys, or observations should never be modified directly. You should provide enough documentation to recreate your processed datasets and results from your raw data (see 3.1. Data Processing & Analyses).

  • Organize, document, and store your research files so they remain usable, and become FAIR upon sharing. Beyond raw data, you will generate processed data, code, documentation, and metadata. How you organize, describe, and store these determines whether your work remains usable and reproducible. The FAIR principles guide these decisions: making outputs Findable, Accessible, Interoperable, and Reusable.

What follows are general practices. Your domain has specific conventions for file formats, folder structure, and metadata. RDMkit provides detailed guidance organized by research area.

TipTip for research groups to streamline this process
  • Create standard operating procedures within the team. For instance: where and when data back up should be made and in which file format, what project folder structure and conventions for naming files should be adopted, which metadata should routinely be acquired, what documentation should be created and when, and how and where a history of versions should be preserved.
  • Assign responsibilities. For developing such SOPs for a specific kind of project and for checking all members have implemented those SOPs at a specified time point in their project (e.g. hold a data quality validation meeting prior to starting data analyses, see Collect & Manage Checklist).
  • 2.2.1. Storage
  • 2.2.2. Organization
  • 2.2.3. File Formats
  • 2.2.4. Documentation
  • 2.2.5. Standards
  • 2.2.6. Version Control

This section covers storage for data you are actively collecting. Long-term archiving for sharing is covered in 4. Preserve & Share.


  • Use institutional storage. LMU Munich provides storage such as LRZ Sync and Share or LRZ DSS with automated backups, access controls, and GDPR compliance. Additional options vary by department. Contact the Research Data Management team of the University Library to find what is available to you. When choosing, consider how much data you will generate, who needs access, and whether your data includes personal information requiring stricter controls.
  • Follow the 3-2-1 backup rule. Keep three copies on two media types with one off-site. Designate one location as the master copy, the authoritative version everything else syncs from. Working with multiple “equal” copies creates version conflicts. Remember that syncing is not backup: if you delete a file from a synced folder, the deletion propagates everywhere. True backups preserve previous versions independently.
  • Control access from the start. Grant access only to those who need it. Use institutional sharing tools, not email attachments or personal cloud links. For collaborations, agree at the start who can read, who can edit, and who manages permissions. When team members leave, remove their access promptly.
  • Test your backups. A backup you cannot restore is not a backup. Test restoration at least once. Archive inactive data periodically and review access lists when team composition changes.
ImportantAvoid for Research Data

Personal laptops as primary storage, external drives as only copy, consumer cloud services (Dropbox, Google Drive) for sensitive data, and USB drives except for temporary transport.

LEARN MORE

LRZ logo
Supported at LMU

LRZ Sync & Share

Cloud storage service for LMU Munich

LRZ logo
Supported at LMU

LRZ DSS

Long-term archival storage for LMU Munich

OSF logo

OSF

Research project management platform including storage.

Your folder structure and file naming conventions determine whether you and others can navigate your project months or years later. Establish these conventions at the start of your project and document them. When collaborating, ensure everyone follows the same system.


  • Separate raw from processed data. Raw data should not be modified: once collected, these files should never be touched. All cleaning, data processing, transformations, and analyses happen on copies or through scripts in a separate folder. This preserves your ability to verify results or reprocess from the original source.

  • Develop a file naming convention. Good file names identify contents at a glance and sort correctly. Balance specificity with readability: too many elements make names unwieldy, too few make them ambiguous. Order elements from general to specific.

    • Use underscores or hyphens to separate elements, never spaces or special characters (? ! & * % # @)
    • Use ISO 8601 dates (YYYYMMDD) so files sort chronologically
    • Include version numbers with leading zeros (v01, v02) so v10 sorts after v09
    • Use meaningful abbreviations and document what they mean

A pattern like YYYYMMDD_project_condition_type_v01.ext places files in chronological order while preserving context. For example, 20240315_sleep-study_control_survey_v02.csv immediately tells you when it was created, which project it belongs to, the experimental condition, data type, and revision. Document your convention in a README file stored next to your data files so collaborators can parse filenames without asking.


  • Follow domain standards where they exist. Many fields have established organizational conventions that tools and collaborators expect. Using these means your data can immediately be integrated in existing analysis pipelines and reviewers recognize the structure. Search RDMkit for standards in your domain.

LEARN MORE

LMU OSC logo
OSC Tutorial

Data Organization

Folder structure and naming conventions. (30 min)

TOOLS & RESOURCES

LMU OSC logo
GitHub logo
OSC Tool

Research Project Template

Project folder separating raw & processed data, code, and outputs.

RDMkit logo

RDMkit

Domain-specific data management standards

File format choices affect who can work with your data now and whether it remains readable in the future. Open formats have publicly documented specifications that anyone can implement, so many programs can read them and they remain accessible even if the original software disappears. Proprietary formats lock you into specific tools, complicate collaboration, and risk becoming unreadable if the company stops supporting them.


  • Keep raw data in its original format. Whatever your instrument or source produces, preserve that original as your ground truth. Even if it is proprietary, you need it for verification and potential reprocessing.

  • Work in open formats. For analysis, convert to open formats like CSV, JSON, or plain text. This makes your workflow reproducible, enables collaboration across different tools, and ensures your data can be shared. If conversion loses important information (metadata, precision, structure), document what is lost and keep both versions.

  • Be careful with spreadsheets. Excel is convenient for data entry but causes real problems. It silently converts data: gene names like MARCH1 become dates, leading zeros in IDs disappear, and long numbers lose precision. Formatting (colors, merged cells) breaks machine-readability since scripts cannot see it. If you use spreadsheets for entry, keep them simple (one header row, one observation per row, no merged cells) and export to CSV immediately. Save CSVs with UTF-8 encoding to avoid character corruption when sharing across systems. For more guidance on spreadsheet best practices, see The Turing Way and UC Davis DataLab.

  • Check domain recommendations. Your field likely has established conventions balancing openness with practical needs like performance or metadata preservation. Consult the RDMKit to find conventions for your field.


Format issues often surface during quality control. The 2.4. Quality Control panel below covers validation checks that can catch encoding problems, unexpected conversions, and structural inconsistencies early.

TOOLS & RESOURCES

RDMkit logo

RDMkit

Domain-specific file formats and conventions

Without documentation, a dataset is just a collection of files. Six months from now, you will not remember what each column means, why certain values are missing, or how files relate to each other. Documentation makes your data usable by your future self, your collaborators, and any other researchers.


  • Create a README file (as .md or .txt) early and update it as you go. Your README is the entry point to your project. Start it when you begin, not when preparing to publish. A good README answers the essential questions: who created the data, what it contains, when and where it was collected, why it was generated, how it was produced, and whether it can be reused. These answers let someone unfamiliar with your project understand and work with your data.
  • Create a data dictionary defining every variable. A data dictionary (or “codebook”) makes your dataset self-explanatory. For each variable, document what it measures, its data type, valid values, units of measurement, and how missing data is coded. Use appropriate missing codes to distinguish why data is absent (declined to answer, not applicable, technical failure) since this distinction matters for analysis.

LEARN MORE

LMU OSC logo
OSC Tutorial

Principles of Data Documentation

Principles of README files and data dictionaries.

LMU OSC logo
OSC Tutorial

Data Documentation & Validation

Create READMEs, data dictionaries, and validation checks for your data. (1h)

Standards are community agreements on how to organize and describe research data. Using them means others in your field immediately understand your data. Three types of standards matter here:


  • Organizational standards specify how to structure files and folders. Some fields have well-established conventions, like BIDS for neuroimaging data. When such standards exist, use them. Your data will work immediately with existing tools, and collaborators will recognize the structure without explanation. If no standard exists for your domain, create a consistent structure and document it in your README. See for instance our research project template.

  • Reporting guidelines specify what methodological details to document for different study types. The EQUATOR Network maintains a searchable database of guidelines for clinical trials, observational studies, animal research, and many other study types. Following these ensures you capture everything others need to understand or replicate your work.

  • Metadata standards define what descriptive information to record and how to structure it. Scientific metadata describes how your data was produced: equipment specifications, acquisition parameters, protocols followed. This is distinct from discovery metadata (titles, keywords, descriptions) which you will prepare when sharing in 4. Preserve & Share. Your field has conventions for which parameters matter. FAIRsharing catalogs metadata standards by discipline.


Think of your data as a first-class research output. Comprehensive metadata transforms a project artifact into a reusable resource. Someone reanalyzing your data years later needs to understand exactly how it was produced.

TOOLS & RESOURCES

EQUATOR logo

EQUATOR Network

Comprehensive database of reporting guidelines.

FAIRsharing logo

FAIRsharing

Search by discipline to find metadata standards, reporting guidelines, and data policies for your field.

RDMkit logo

RDMkit

Domain-specific metadata standards

LMU OSC logo
GitHub logo
OSC Tool

Research Project Template

Project folder separating raw & processed data, code, and outputs.

Version control system like Git tracks changes to files over time. You can see what changed, when, and why. You can revert to previous versions. Collaborators can work without overwriting each other.


  • Git usually suffices for data files. Text-based formats (CSV, JSON, plain text) and smaller binary files work well in standard Git repositories. You get a complete history of changes and can share easily via GitHub or GitLab.

  • Use specialized tools for large or frequently changing binary files. Standard Git stores each version in full, so repositories become unwieldy with large datasets. Git LFS (Large File Storage) stores large files separately while keeping them tracked. Git-annex manages files across multiple storage locations. DataLad builds on git-annex and works with standard Git workflows.

NoteDifference between Git, GitHub and GitLab
  • Git is a version control system that tracks changes in text files (e.g. CSV, plain text, R, Python). The Git software and your Git repositories should be, respectively, installed and located in your local environment (i.e. on your computer, not on a drive, see Git tutorial).

  • GitHub is the most popular, free but proprietary and US-based cloud-based platform for software development with Git, providing collaboration features like pull requests and issues (see GitHub tutorial). You should not have any sensitive data on GitHub even in a private repository.

  • LRZ GitLab is a cloud-based hosting platform that works exactly the same as GitHub but is open source and is installed on the LRZ servers for LMU Munich and can therefore be considered secure when the repository is private.

While your LRZ GitLab account is associated with your LMU Munich affiliation, your GitHub account can be associated with your private email, be included in your CV, and be used for public sharing of your data and code (see 3. Analyze & Collaborate and 4. Preserve & Share).


In a version controlled workflow, you back up your local Git repositories on either GitHub or LRZ GitLab through a secure SSH connection (see GitHub tutorial) and share access to your repositories with your collaborators through the cloud-based platform GitHub or LRZ GitLab.

LEARN MORE

LMU OSC logo
Git logo
OSC Tutorial

Introduction to Git

Learn Git basics integrated with RStudio (2h).

LMU OSC logo
GitHub logo
OSC Tutorial

Introduction to GitHub

Connect to GitHub from Git within RStudio (1h).

TOOLS & RESOURCES

LMU OSC logo
GitLab logo
Supported at LMU

LRZ Gitlab

Institutional Git hosting for LMU Munich.

GitLab icon

DataLad

Version control for large datasets

2.3 Ethics & Privacy

Research involving human participants requires ethics approval and data protection compliance. In your ethics proposal (see 1.2.3. Ethics) you planned for safeguards for the people contributing to your research and you now need to implement them before or while collecting your data.

  • 2.3.1. Informed Consent
  • 2.3.2. Anonymization
  • 2.3.3. GDPR Compliance

Participants have the right to understand what they are agreeing to. Your consent form should explain the research purpose in plain language, describe what data you will collect and how you will protect it, specify who will have access and for how long, and make clear that participation is voluntary. See “How to write an informed consent form” from the University to Utrecht and an example template from the LMU Psychology Department.


  • Use tiered consent when you plan to share data. Some participants may consent to their data being used for your study but not shared publicly. Others may be comfortable with broader sharing. Giving options respects autonomy while maximizing what you can eventually share.

  • Store consent forms separately from data. The consent form links a name to participation. Keeping it with your data undermines any pseudonymization you apply.

TipTip for research groups:

Maintain a centralized log of all ethics approvals, consent forms, and compliance certifications for easy reference by team members.

TOOLS & RESOURCES

University of Utrecht icon

How to write an informed consent form

University of Utrecht RDM guide.

Anonymization protects privacy and determines what you can share.


  • Remove direct identifiers during collection. Names, addresses, ID numbers, photographs, email addresses. Replace these with codes.

  • Assess indirect identifiers carefully. A combination of age, location, profession, and a rare condition might identify someone even without their name. Timestamps reveal patterns. Free-text responses often contain identifying details participants did not intend to share. Follow our anonymization tutorial (TBA) to learn to evaluate and implement anonymization techniques in R.

  • Generate synthetic data when full anonymization or real data sharing is not possible. Synthetic data is artificially generated data that can serve as a privacy-preserving alternative to sensitive datasets, enabling researchers to reproduce analyses, verify findings, and initiate model development when access to real data is restricted. Sharing synthetic data together with code, rather than sharing no data at all, increases the utility of your research (see 4. Preserve & Share).

NoteDistinction between pseudonymization and anonymization
  • Pseudonymization replaces identifiers with codes while retaining a key that links back to individuals. Pseudonymized data is still personal data under GDPR because re-identification is possible.

  • Anonymization removes all possibility of re-identification. Only truly anonymized data falls outside GDPR scope. Achieving this is harder than it appears, especially with rich datasets.

LEARN MORE

LMU OSC logo
OSC Tutorial

TBA: Data Anonymization

Implement data anonymization techniques in R. (X h)

LMU OSC logo
OSC Tutorial

Synthetic Data

Synthetic data creation in R to balance utility and privacy when sharing data. (3h)

Research at LMU Munich must comply with EU data protection regulations. The core principles: have a lawful basis for processing personal data (usually consent or legitimate research interest), use data only for stated purposes, collect only what you need, delete data when you no longer need it, and protect it against unauthorized access.


In practice: document your lawful basis, include data protection language in consent forms, use institutional storage rather than personal cloud services, restrict access to those who need it, and plan when and how you will delete data.


For data protection guidance, contact the LMU Data Protection Officer or the Research Data Management team of the University Library.

2.4 Quality Control

Quality control catches problems before they propagate into your analysis. The practices here ensure your data is trustworthy and your exclusions are defensible.

Define criteria before looking at your data. This prevents unconscious bias in what you keep and exclude, and demonstrates that your decisions are principled rather than convenient (see 1.4. Study Design & Analysis Plan).

  • 2.4.1. Validation
  • 2.4.2. Cleaning
  • 2.4.3. Exclusions

Validation checks whether your data meets specifications. For example, values can be verified to fall within valid ranges (e.g., age ≥ 0), required fields can be checked for missing entries, categorical variables can be restricted to allowed levels (e.g., “yes/no”), dates can be validated for proper format, and cross-variable constraints can be enforced (e.g., discharge date ≥ admission date). Run checks during collection to catch problems immediately, after collection for systematic review, and after any processing to verify transformations worked correctly.


  • Automate what you can. Check that data types are correct, values fall within expected ranges, required fields are populated, and formats are consistent. These checks should run automatically and flag problems for review.

  • Catch what automation misses with manual review. Sample your data and verify it against the source. Inspect outliers to determine whether they are errors or genuine extreme values. Look for suspicious patterns: survey responses that alternate predictably, reaction times that are impossibly fast.

LEARN MORE

LMU OSC logo
OSC Tutorial

Data Documentation & Validation

Create validation rules and automated checks for your research data. (1h)

Data cleaning handles errors, inconsistencies, and missing values. The cardinal rule: never modify your raw data. All cleaning happens on copies.


  • Correct unambiguous errors. Clear typos, obvious data entry mistakes. For ambiguous cases, flag them for review rather than making assumptions. Document your reasoning for every judgment call.

  • Handle missing data consistently. Decide on a coding scheme (NA, -999, blank) and apply it uniformly. When you know why data is missing, record that information. It may matter for analysis.

  • Investigate outliers before acting. An extreme value might be an error, or it might be genuine. Understand the cause before deciding whether to remove, transform, or retain it.

  • Write cleaning as a script. A script documents exactly what you did and lets you reproduce it. Keep a decision log for choices that cannot be automated.

Exclusion criteria specify which data points will be removed from analysis and why. Define these before you see your results (see 1.4. Study Design & Analysis Plan).


  • Review common exclusion criteria. Technical failures (equipment malfunction, incomplete recording), protocol violations (wrong procedure followed, participant did not comply), quality thresholds (too much missing data, failed attention checks), and participant criteria (did not meet stated inclusion criteria).

  • Document everything. Record criteria before analysis begins. Report how many data points were excluded for each criterion in a log with the date and reviewer’s identity. Plan sensitivity analyses comparing results with and without exclusions to show your findings are robust.

Collect & Manage Checklist

To complete before conducting a data validation meeting with members of the project. Not all items are relevant for all fields of research or study types.

Throughout Data Collection

Before Moving to Analysis

Download checklist

Ludwig-Maximilians-Universität
LMU Open Science Center

Leopoldstr. 13
80802 München

Contact

  • Prof. Dr. Felix Schönbrodt (Managing Director)
  • Dr. Malika Ihle (Coordinator)
  • OSC team

Join Us

  • Subscribe to our announcement list
  • Become a member
  • Chat with us on Matrix

Imprint | Privacy Policy | Accessibility