Example Data & Tools

In this tutorial, we’ll work with real research data to learn data documentation principles. You’ll also get familiar with the key R packages that make data validation and documentation efficient.

Our Example Dataset: Palmer Penguins

Palmer Penguins

We’ll use the Palmer Penguins dataset throughout this tutorial. This dataset contains observations of three penguin species collected from three islands in the Palmer Archipelago, Antarctica.

The dataset includes measurements like:

  • Bill length and depth (millimeters)
  • Flipper length (millimeters)
  • Body mass (grams)
  • Species, island, and sex information

This dataset is perfect for learning because it:

  • Contains different data types (numeric, categorical)
  • Has some missing values (realistic!)
  • Is scientifically meaningful and well-documented
  • Is small enough to understand completely

Key R Packages We’ll Use

Throughout this tutorial, you’ll work with three main packages:

  • {datawizard}: Automates creation of data dictionaries
  • {pointblank}: Performs systematic data validation
  • {skimr}: Generates comprehensive data summaries

Installing Required Packages

Install the packages you’ll need:

# Install core packages
install.packages(c(
  "datawizard",      # Data dictionaries
  "pointblank",      # Data validation
  "skimr",           # Summary statistics
  "dplyr",           # Data manipulation
  "readr"            # Reading CSV files
))

# Update renv lockfile
renv::snapshot()

Download the Example Datasets

For this tutorial, you’ll work with two versions of the Palmer Penguins data:

  1. Clean data: High-quality dataset for learning documentation
  2. Messy data: Version with realistic data quality issues for practicing validation

The messy dataset contains 30 intentional data quality problems (~9% error rate) including:

  • Typos in species and island names
  • Inconsistent formatting (e.g., “M” vs “male”, extra spaces)
  • Impossible values (negative measurements, placeholder values like 999)
  • Out-of-range values (body mass of 15,000g when max should be ~6,300g)
  • Data entry errors (wrong years, missing digits)

Download Both Files

Click the buttons below to download both datasets:

Download Clean Data

Download Messy Data

Important: Save both files in a folder called data/ inside your project directory. If the data/ folder doesn’t exist yet, create it first.

Load and Verify the Data

Now load the clean data and verify it worked:

library(readr)
library(dplyr)

# Load the clean penguins data
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Take a first look
glimpse(penguins_clean)

You should see 344 observations of 8 variables.

TipCheckpoint: Verify Your Data

Before continuing, verify the data loaded correctly:

# Should show 344 rows and 8 columns
dim(penguins_clean)

# Should display: species, island, bill_length_mm, bill_depth_mm,
#                 flipper_length_mm, body_mass_g, sex, year
names(penguins_clean)

# Check first few rows
head(penguins_clean)

If you see errors:

  • Make sure both CSV files are in the data/ folder
  • Check that your working directory is your project folder (use getwd() to verify)
  • Try restarting R

We introduced 30 realistic data quality issues across 344 observations (~9% error rate, typical for real-world field data):

  • Species: 4 issues (typos like “Adelei”, case errors, extra text like “Gentoo penguin”)
  • Island: 2 issues (typos like “Torgerson”, case errors like “biscoe”)
  • Bill length: 3 issues (negative -5.2, impossible 250.5, placeholder 999)
  • Bill depth: 2 issues (negative -2.1, placeholder 99.9)
  • Flipper length: 1 issue (zero value)
  • Body mass: 3 issues (extreme outliers: 15000g, 500g, 10000g)
  • Sex: 11 issues (inconsistent coding: M/F/male/Male/MALE/m/Female)
  • Year: 4 issues (2020, 2006, 207, 20009)

These mirror common problems in real research data: data entry mistakes, sensor errors, placeholder values not removed, and transcription errors. The generation script is in _scripts/create_messy_data.R.

Next Steps

You now have your environment set up with the necessary packages and both datasets. In the next section, you’ll learn how to create data dictionaries that document what each variable means and how it should be interpreted.

Back to top