Data Documentation and Validation using R
Research data without proper documentation becomes a barrier to reproducibility and collaboration. This tutorial teaches you to document, summarize, and validate your research data using R, focusing on practical skills that make your work more transparent and reusable.
Why Document Your Data?
Good documentation enables reproducibility, facilitates collaboration, and meets increasing requirements from funding agencies and journals for open research practices. Well-documented data is simply better science.
What You’ll Learn
By the end of this tutorial, you will be able to:
- Create data dictionaries that clearly describe your variables and datasets (both manually and automatically)
- Use summary statistics to identify data quality issues and understand your data’s characteristics
- Implement automated validation workflows to catch errors systematically
- Generate professional reports that combine documentation, validation results, and code
We’ll use real research data (Palmer Penguins) and modern R packages to build a complete data documentation workflow.
Prerequisites
This tutorial assumes you have completed (or are familiar with) the following LMU OSC tutorials:
You should also have:
- Basic R and RStudio familiarity
- Experience working with data frames
- Understanding of variables, observations, and data types
Key R Packages
Throughout this tutorial, you’ll work with:
{datawizard}
- Automated data dictionary creation{pointblank}
- Comprehensive data validation{dplyr}
- Data manipulation{readr}
- Reading CSV data files
Tutorial Structure
1. Tutorial Setup
Set up your R project environment with Quarto, Git, and renv. Install required packages and get familiar with the Palmer Penguins dataset.
2. Data Dictionaries
Learn to create comprehensive variable documentation:
- Concepts & Manual Creation: Understand what makes a good data dictionary and create one by hand
- Automated Creation with R: Use the
{datawizard}
package to generate dictionaries automatically
3. Data Validation
Implement systematic quality checks to catch errors early:
- Data Quality Concepts: Understand types of data issues and validation approaches
- Summary Statistics: Use base R functions to identify problems through descriptive statistics
- Validation with R Packages: Automate quality checks with
{pointblank}
Example Dataset
We’ll use the Palmer Penguins dataset throughout this tutorial. This dataset contains observations of three penguin species collected from islands in the Palmer Archipelago, Antarctica, including measurements of bill dimensions, flipper length, body mass, and other characteristics.
You’ll work with both a clean version (high-quality data) and a messy version (with realistic data quality issues) to practice identifying and fixing common problems.
Getting Started
Ready to begin? Start with Tutorial Setup to configure your R environment.