Data Documentation and Validation using R

Research data without proper documentation becomes a barrier to reproducibility and collaboration. This tutorial teaches you to document, summarize, and validate your research data using R, focusing on practical skills that make your work more transparent and reusable.

Why Document Your Data?

Good documentation enables reproducibility, facilitates collaboration, and meets increasing requirements from funding agencies and journals for open research practices. Data documentation is an essential pillar of good research practices which will allow for your research to have greater impact.

What You’ll Learn

By the end of this tutorial, you will be able to:

Create data dictionaries that clearly describe your variables and datasets (both manually and automatically)
Use summary statistics to identify data quality issues and understand your data’s characteristics
Implement automated validation workflows to catch errors systematically
Generate professional reports that combine documentation, validation results, and code

We’ll use real research data (Palmer Penguins) and modern R packages to build a complete data documentation workflow.

Prerequisites

This tutorial assumes you have completed (or are familiar with) the following LMU OSC tutorials:

You should also have:

Basic R and RStudio familiarity
Experience working with data frames
Understanding of variables, observations, and data types

Tutorial Structure

1. Tutorial Setup

Set up your R project environment with Quarto. Install required packages and get familiar with the Palmer Penguins dataset.

2. Data Dictionaries

Learn to create comprehensive variable documentation:

Concepts & Manual Creation: Understand what makes a good data dictionary and create one by hand
Automated Creation with R: Use the {datawizard} package to generate dictionaries automatically

3. Data Validation

Implement systematic quality checks to catch errors early:

Data Quality Concepts: Understand types of data issues and validation approaches
Summary Statistics: Use base R functions to identify problems through descriptive statistics, create validation functions, and integrate checks into living documentation

Example Dataset

We’ll use the Palmer Penguins dataset throughout this tutorial. This dataset contains observations of three penguin species collected from islands in the Palmer Archipelago, Antarctica, including measurements of bill dimensions, flipper length, body mass, and other characteristics.

You’ll work with both a clean version (high-quality data) and a messy version (with realistic data quality issues) to practice identifying and fixing common problems.

Tutorial Navigation

The topics are accompanied by distinct boxes that are color-coded for their content:

Orange boxes contain information crucial for that topic

These highlight key concepts, important warnings, or critical information you need to understand.

Blue boxes contain excursions to related topics

These provide additional context, connections to other concepts, or deeper explanations for the curious learner.

Green boxes contain practical tips and guidance

These offer helpful advice, best practices, and shortcuts to make your work easier.

Red boxes contain solutions and are collapsed

Only open these if you want to see the solution! Try exercises on your own first.

Getting Started

Ready to begin? Start with Tutorial Setup to configure your R environment.