library(readr)
# Load clean penguins data from CSV
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
# Load messy penguins data (we'll compare them later)
penguins_messy <- read_csv("data/penguins_messy.csv", show_col_types = FALSE)Summary Statistics
Summary statistics are your first line of defense in data validation. They provide a systematic way to spot quality issues, understand your data’s characteristics, and verify that reality matches your documented expectations.
By the end of this section, you will:
- Use
summary()to get a quick overview of data quality - Apply base R functions to check numeric and categorical variables
- Compare clean vs. messy data to spot problems
- Identify errors in the messy Palmer Penguins dataset
How Summary Statistics Reveal Problems
Summary statistics help you verify that your actual data matches what you documented in your data dictionary:
- Range checks: Are values within expected bounds?
- Missing data: How much data is absent?
- Category checks: Are there unexpected values or typos?
For example, if your data dictionary says bill length should be 30-60mm, but min() returns -5.2mm, you’ve found a data quality problem.
Load and Compare the Data
Let’s load both the clean and messy Palmer Penguins datasets:
Using summary() to Spot Issues
The summary() function provides a quick overview of your entire dataset. Let’s use it to compare clean and messy data and see what issues we can spot:
# Compare clean and messy data
cat("=== CLEAN DATA ===\n")
summary(penguins_clean)
cat("\n\n=== MESSY DATA ===\n")
summary(penguins_messy)What summary() Shows
For numeric variables, summary() displays:
- Minimum and maximum values
- Quartiles (25th, 50th/median, 75th percentiles)
- Mean
- Count of missing values (NA’s)
For character/categorical variables, it shows:
- Length and class information
Compare the clean and messy outputs. Look for:
- Negative values where they shouldn’t exist (e.g., negative bill lengths)
- Extremely large or small numbers (outliers or data entry errors)
- Unexpected NA counts
- Check if the minimum and maximum values make biological sense
What issues can you spot in the messy data?
Check Categorical Variables
Now check for typos and unexpected values in categorical variables:
# Examine categorical variables in messy data
cat("=== Species ===\n")
table(penguins_messy$species, useNA = "always")
cat("\n=== Island ===\n")
table(penguins_messy$island, useNA = "always")
cat("\n=== Sex ===\n")
table(penguins_messy$sex, useNA = "always")
cat("\n=== Year ===\n")
table(penguins_messy$year, useNA = "always")What to look for:
- Are all values ones you expected? (3 species, 3 islands, 2 sexes, 3 years)
- Are there typos or variant spellings?
- Are there unexpected missing values?
Check Numeric Variables in Detail
While summary() is great for a quick overview, sometimes you may want to focus on specific variables. Let us inspect individual variables using base R functions and printing statistics that makes sense for your data:
# Detailed examination of a numeric variable
cat("=== Bill Length (detailed) ===\n")
cat("Range:", min(penguins_messy$bill_length_mm, na.rm = TRUE), "to",
max(penguins_messy$bill_length_mm, na.rm = TRUE), "\n")
cat("Mean:", round(mean(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Median:", round(median(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Standard deviation:", round(sd(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Missing values:", sum(is.na(penguins_messy$bill_length_mm)), "\n")Check Missing Values
# How many missing values in each variable?
colSums(is.na(penguins_messy))Compare to your data dictionary: sex can have NAs (couldn’t be determined in field), but species should never be missing.
For numeric variables:
min()/max(): Find range boundariesmean(): Average (affected by outliers)median(): Middle value (robust to outliers)sd(): Standard deviation (variability)sum(is.na()): Count missing values
For categorical variables:
table(): Frequency counts for each categoryunique(): List all distinct valueslength(unique()): Count how many distinct values
Important: Always use na.rm = TRUE with numeric functions to handle missing values!
Summary Statistics Workflow
When validating new data:
- Use
summary()for a quick overview - Check categorical variables with
table()andunique() - Check numeric ranges with
min(),max(),range() - Check missing values with
colSums(is.na()) - Compare findings to your data dictionary and expectations
Automating Validation with Functions
You can wrap these checks into a function that automatically validates data based on your documented rules:
# Function to validate penguins data based on data dictionary rules
validate_penguins <- function(data) {
cat("=== VALIDATION REPORT ===\n\n")
# Structural check: Valid species (from data dictionary)
valid_species <- c("Adelie", "Chinstrap", "Gentoo")
if (all(data$species %in% c(valid_species, NA))) {
cat("Species: All values valid\n")
} else {
cat("Species: Invalid values found\n")
}
# Content check: Bill length range (from data dictionary: 30-60mm)
if (all(data$bill_length_mm >= 30 & data$bill_length_mm <= 60, na.rm = TRUE)) {
cat("Bill length: All values in valid range (30-60mm)\n")
} else {
cat("Bill length: Out-of-range values found\n")
}
# Domain check: Body mass biologically plausible (2500-6500g)
if (all(data$body_mass_g >= 2500 & data$body_mass_g <= 6500, na.rm = TRUE)) {
cat("Body mass: All values biologically plausible\n")
} else {
cat("Body mass: Implausible values found\n")
}
}
# Test with clean data
cat("CLEAN DATA:\n")
validate_penguins(penguins_clean)
# Test with messy data
cat("\n\nMESSY DATA:\n")
validate_penguins(penguins_messy)How this function works:
validate_penguins()takes a data frame as input and checks it against rules from your data dictionaryall()checks if every value meets a condition (returns TRUE only if all values pass)%in%checks if values are in the allowed list (e.g., species must be Adelie, Chinstrap, or Gentoo)na.rm = TRUEignores missing values when checking numeric ranges- The function prints different messages depending on whether data passes or fails each check
Three types of validation demonstrated:
- Structural: Are species names valid categories? (Adelie, Chinstrap, Gentoo)
- Content: Are bill lengths within the documented range? (30-60mm)
- Domain: Are body masses biologically plausible? (2500-6500g)
When you run this function on the clean data, all checks pass. When you run it on the messy data, it fails multiple checks and shows you exactly where problems exist.
Integrating Validation into Your Documentation
Remember the penguins_documentation.qmd file you created in the data dictionaries section? Let’s add validation checks to it! To keep things simple, we’ll add a new section that summarizes key validation statistics.
Update your penguins_documentation.qmd file to include validation checks.
Add this section after the automated codebook:
## Data Validation
Below are summary statistics to check data quality:
```{r}
#| echo: false
#| message: false
# Quick validation checks
cat("=== Data Quality Summary ===\n\n")
# Check for data quality issues in numeric variables
cat("Numeric Variables - Range Checks:\n")
cat("Bill length range:", range(penguins$bill_length_mm, na.rm = TRUE), "mm\n")
cat("Bill depth range:", range(penguins$bill_depth_mm, na.rm = TRUE), "mm\n")
cat("Body mass range:", range(penguins$body_mass_g, na.rm = TRUE), "g\n\n")
# Check categorical variables for unexpected values
cat("Categorical Variables - Unique Value Counts:\n")
cat("Species (expect 3):", length(unique(penguins$species)), "\n")
cat("Islands (expect 3):", length(unique(penguins$island)), "\n")
cat("Years (expect 3):", length(unique(penguins$year)), "\n\n")
# Check missing values
cat("Missing Values by Variable:\n")
print(colSums(is.na(penguins)))
```Then render the document to see your complete data dictionary with built-in validation!
Why this is powerful:
- Data dictionary + validation in one place gives complete documentation
- Updates automatically when data changes
- Easy to share with collaborators
- Documents both what data should be and what it actually is
As you become more comfortable with data validation, you may want to explore specialized R packages that automate quality checks and generate professional reports:
{pointblank} (Iannone, Vargas, and Choe 2024) - Comprehensive data validation toolkit
- Define validation rules (e.g., “bill length must be 30-60mm”)
- Generate HTML validation reports with pass/fail results
- Set quality thresholds and get alerts when data fails checks
- Best for: Projects with many datasets or frequent data updates
{validate} (van der Loo and de Jonge 2021) - Rule-based data validation
- Create validation rules in plain language syntax
- Check data against your rules and get detailed reports
- Track which rows fail which rules
- Best for: Complex validation logic across multiple variables
{assertr} (Fischetti 2023) - Defensive programming for data pipelines
- Add validation checks directly into your data processing code
- Stop execution if data fails quality checks
- Best for: Automated workflows and data pipelines
{skimr} (Waring et al. 2025) - Enhanced summary statistics
- More informative summaries than base R
summary() - Includes histograms and additional statistics
- Best for: Quick exploratory data validation
These tools build on the concepts you’ve learned in this tutorial. Start with the base R functions you now know, and explore these packages when your validation needs grow more complex.
Conclusion
Congratulations! You’ve completed this tutorial on data documentation and validation in R. You now know how to:
- Create data dictionaries manually and automatically with
{datawizard} - Identify data quality issues using summary statistics
- Integrate documentation and validation into reproducible Quarto documents
- Build a complete data quality workflow for your research
These skills form the foundation for reproducible, transparent, and trustworthy research. By documenting and validating your data systematically, you’re contributing to better science and making your work more accessible to others.
Your next steps: Apply these techniques to your own research data. Start small with one dataset, create its documentation, validate it, and build from there. Good data practices become easier with practice!