Data Documentation and Validation using R

Research data without proper documentation becomes a barrier to reproducibility and collaboration. This tutorial teaches you to document, summarize, and validate your research data using R, focusing on practical skills that make your work more transparent and reusable.

Why Document Your Data?

Good documentation enables reproducibility, facilitates collaboration, and meets increasing requirements from funding agencies and journals for open research practices. Well-documented data is simply better science.

What You’ll Learn

By the end of this tutorial, you will be able to:

  • Create data dictionaries that clearly describe your variables and datasets (both manually and automatically)
  • Use summary statistics to identify data quality issues and understand your data’s characteristics
  • Implement automated validation workflows to catch errors systematically
  • Generate professional reports that combine documentation, validation results, and code

We’ll use real research data (Palmer Penguins) and modern R packages to build a complete data documentation workflow.

Prerequisites

This tutorial assumes you have completed (or are familiar with) the following LMU OSC tutorials:

You should also have:

  • Basic R and RStudio familiarity
  • Experience working with data frames
  • Understanding of variables, observations, and data types

Key R Packages

Throughout this tutorial, you’ll work with:

  • {datawizard} - Automated data dictionary creation
  • {pointblank} - Comprehensive data validation
  • {dplyr} - Data manipulation
  • {readr} - Reading CSV data files

Tutorial Structure

1. Tutorial Setup

Set up your R project environment with Quarto, Git, and renv. Install required packages and get familiar with the Palmer Penguins dataset.

2. Data Dictionaries

Learn to create comprehensive variable documentation:

  • Concepts & Manual Creation: Understand what makes a good data dictionary and create one by hand
  • Automated Creation with R: Use the {datawizard} package to generate dictionaries automatically

3. Data Validation

Implement systematic quality checks to catch errors early:

  • Data Quality Concepts: Understand types of data issues and validation approaches
  • Summary Statistics: Use base R functions to identify problems through descriptive statistics
  • Validation with R Packages: Automate quality checks with {pointblank}

Example Dataset

We’ll use the Palmer Penguins dataset throughout this tutorial. This dataset contains observations of three penguin species collected from islands in the Palmer Archipelago, Antarctica, including measurements of bill dimensions, flipper length, body mass, and other characteristics.

You’ll work with both a clean version (high-quality data) and a messy version (with realistic data quality issues) to practice identifying and fixing common problems.

Tutorial Navigation

The topics are accompanied by distinct boxes that are color-coded for their content:

CautionOrange boxes contain information crucial for that topic

These highlight key concepts, important warnings, or critical information you need to understand.

NoteBlue boxes contain excursions to related topics

These provide additional context, connections to other concepts, or deeper explanations for the curious learner.

TipGreen boxes contain practical tips and guidance

These offer helpful advice, best practices, and shortcuts to make your work easier.

ImportantRed boxes contain solutions and are collapsed

Only open these if you want to see the solution! Try exercises on your own first.

Getting Started

Ready to begin? Start with Tutorial Setup to configure your R environment.

Back to top