# Install core packages
install.packages(c(
"datawizard", # Data dictionaries
"pointblank", # Data validation
"skimr", # Summary statistics
"dplyr", # Data manipulation
"readr" # Reading CSV files
))
# Update renv lockfile
::snapshot() renv
Example Data & Tools
In this tutorial, we’ll work with real research data to learn data documentation principles. You’ll also get familiar with the key R packages that make data validation and documentation efficient.
Our Example Dataset: Palmer Penguins
We’ll use the Palmer Penguins dataset throughout this tutorial. This dataset contains observations of three penguin species collected from three islands in the Palmer Archipelago, Antarctica.
The dataset includes measurements like:
- Bill length and depth (millimeters)
- Flipper length (millimeters)
- Body mass (grams)
- Species, island, and sex information
This dataset is perfect for learning because it:
- Contains different data types (numeric, categorical)
- Has some missing values (realistic!)
- Is scientifically meaningful and well-documented
- Is small enough to understand completely
Key R Packages We’ll Use
Throughout this tutorial, you’ll work with three main packages:
{datawizard}
: Automates creation of data dictionaries{pointblank}
: Performs systematic data validation{skimr}
: Generates comprehensive data summaries
Installing Required Packages
Install the packages you’ll need:
Download the Example Datasets
For this tutorial, you’ll work with two versions of the Palmer Penguins data:
- Clean data: High-quality dataset for learning documentation
- Messy data: Version with realistic data quality issues for practicing validation
The messy dataset contains 30 intentional data quality problems (~9% error rate) including:
- Typos in species and island names
- Inconsistent formatting (e.g., “M” vs “male”, extra spaces)
- Impossible values (negative measurements, placeholder values like 999)
- Out-of-range values (body mass of 15,000g when max should be ~6,300g)
- Data entry errors (wrong years, missing digits)
Download Both Files
Click the buttons below to download both datasets:
Important: Save both files in a folder called data/
inside your project directory. If the data/
folder doesn’t exist yet, create it first.
Load and Verify the Data
Now load the clean data and verify it worked:
library(readr)
library(dplyr)
# Load the clean penguins data
<- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
penguins_clean
# Take a first look
glimpse(penguins_clean)
You should see 344 observations of 8 variables.
We introduced 30 realistic data quality issues across 344 observations (~9% error rate, typical for real-world field data):
- Species: 4 issues (typos like “Adelei”, case errors, extra text like “Gentoo penguin”)
- Island: 2 issues (typos like “Torgerson”, case errors like “biscoe”)
- Bill length: 3 issues (negative -5.2, impossible 250.5, placeholder 999)
- Bill depth: 2 issues (negative -2.1, placeholder 99.9)
- Flipper length: 1 issue (zero value)
- Body mass: 3 issues (extreme outliers: 15000g, 500g, 10000g)
- Sex: 11 issues (inconsistent coding: M/F/male/Male/MALE/m/Female)
- Year: 4 issues (2020, 2006, 207, 20009)
These mirror common problems in real research data: data entry mistakes, sensor errors, placeholder values not removed, and transcription errors. The generation script is in _scripts/create_messy_data.R
.
Next Steps
You now have your environment set up with the necessary packages and both datasets. In the next section, you’ll learn how to create data dictionaries that document what each variable means and how it should be interpreted.