Summary Statistics

Summary statistics are your first line of defense in data validation. They provide a systematic way to spot quality issues, understand your data’s characteristics, and verify that reality matches your documented expectations.

What You’ll Learn

By the end of this section, you will:

Use summary() to get a quick overview of data quality
Apply base R functions to check numeric and categorical variables
Compare clean vs. messy data to spot problems
Identify errors in the messy Palmer Penguins dataset

How Summary Statistics Reveal Problems

Summary statistics help you verify that your actual data matches what you documented in your data dictionary:

Range checks: Are values within expected bounds?
Missing data: How much data is absent?
Category checks: Are there unexpected values or typos?

For example, if your data dictionary says bill length should be 30-60mm, but min() returns -5.2mm, you’ve found a data quality problem.

Load and Compare the Data

Let’s load both the clean and messy Palmer Penguins datasets:

library(readr)

# Load clean penguins data from CSV
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Load messy penguins data (we'll compare them later)
penguins_messy <- read_csv("data/penguins_messy.csv", show_col_types = FALSE)

Using summary() to Spot Issues

The summary() function provides a quick overview of your entire dataset. Let’s use it to compare clean and messy data and see what issues we can spot:

# Compare clean and messy data
cat("=== CLEAN DATA ===\n")
summary(penguins_clean)

cat("\n\n=== MESSY DATA ===\n")
summary(penguins_messy)

What `summary()` Shows

For numeric variables, summary() displays:

Minimum and maximum values
Quartiles (25th, 50th/median, 75th percentiles)
Mean
Count of missing values (NA’s)

For character/categorical variables, it shows:

Length and class information

Exercise: Spot the Differences

Compare the clean and messy outputs. Look for:

Negative values where they shouldn’t exist (e.g., negative bill lengths)
Extremely large or small numbers (outliers or data entry errors)
Unexpected NA counts
Check if the minimum and maximum values make biological sense

What issues can you spot in the messy data?

Check Categorical Variables

Now check for typos and unexpected values in categorical variables:

# Examine categorical variables in messy data
cat("=== Species ===\n")
table(penguins_messy$species, useNA = "always")

cat("\n=== Island ===\n")
table(penguins_messy$island, useNA = "always")

cat("\n=== Sex ===\n")
table(penguins_messy$sex, useNA = "always")

cat("\n=== Year ===\n")
table(penguins_messy$year, useNA = "always")

What to look for:

Are all values ones you expected? (3 species, 3 islands, 2 sexes, 3 years)
Are there typos or variant spellings?
Are there unexpected missing values?

Check Numeric Variables in Detail

While summary() is great for a quick overview, sometimes you may want to focus on specific variables. Let us inspect individual variables using base R functions and printing statistics that makes sense for your data:

# Detailed examination of a numeric variable
cat("=== Bill Length (detailed) ===\n")
cat("Range:", min(penguins_messy$bill_length_mm, na.rm = TRUE), "to",
    max(penguins_messy$bill_length_mm, na.rm = TRUE), "\n")
cat("Mean:", round(mean(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Median:", round(median(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Standard deviation:", round(sd(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Missing values:", sum(is.na(penguins_messy$bill_length_mm)), "\n")

Check Missing Values

# How many missing values in each variable?
colSums(is.na(penguins_messy))

Compare to your data dictionary: sex can have NAs (couldn’t be determined in field), but species should never be missing.

Key Functions Explained

For numeric variables:

min() / max(): Find range boundaries
mean(): Average (affected by outliers)
median(): Middle value (robust to outliers)
sd(): Standard deviation (variability)
sum(is.na()): Count missing values

For categorical variables:

table(): Frequency counts for each category
unique(): List all distinct values
length(unique()): Count how many distinct values

Important: Always use na.rm = TRUE with numeric functions to handle missing values!

Checkpoint: What Did You Find?

After running the code above on the messy data, you should have spotted:

Numeric problems:

Negative bill lengths
Impossibly large or small values
Placeholder values (999)

Categorical problems:

Typos in species names (“Adelei”, “ADELIE”, “Gentoo penguin”)
Typos in island names (“Torgerson”, “biscoe”)
Inconsistent sex coding (“M”, “F”, “male”, “Male”, “MALE”)
Invalid years (2020, 2006, 207, 20009)

Did you find these issues? If so, you’ve successfully used summary statistics for data validation!

Summary Statistics Workflow

When validating new data:

Use summary() for a quick overview
Check categorical variables with table() and unique()
Check numeric ranges with min(), max(), range()
Check missing values with colSums(is.na())
Compare findings to your data dictionary and expectations

Automating Validation with Functions

You can wrap these checks into a function that automatically validates data based on your documented rules:

# Function to validate penguins data based on data dictionary rules
validate_penguins <- function(data) {
  cat("=== VALIDATION REPORT ===\n\n")

  # Structural check: Valid species (from data dictionary)
  valid_species <- c("Adelie", "Chinstrap", "Gentoo")
  if (all(data$species %in% c(valid_species, NA))) {
    cat("Species: All values valid\n")
  } else {
    cat("Species: Invalid values found\n")
  }

  # Content check: Bill length range (from data dictionary: 30-60mm)
  if (all(data$bill_length_mm >= 30 & data$bill_length_mm <= 60, na.rm = TRUE)) {
    cat("Bill length: All values in valid range (30-60mm)\n")
  } else {
    cat("Bill length: Out-of-range values found\n")
  }

  # Domain check: Body mass biologically plausible (2500-6500g)
  if (all(data$body_mass_g >= 2500 & data$body_mass_g <= 6500, na.rm = TRUE)) {
    cat("Body mass: All values biologically plausible\n")
  } else {
    cat("Body mass: Implausible values found\n")
  }
}

# Test with clean data
cat("CLEAN DATA:\n")
validate_penguins(penguins_clean)

# Test with messy data
cat("\n\nMESSY DATA:\n")
validate_penguins(penguins_messy)

How this function works:

validate_penguins() takes a data frame as input and checks it against rules from your data dictionary
all() checks if every value meets a condition (returns TRUE only if all values pass)
%in% checks if values are in the allowed list (e.g., species must be Adelie, Chinstrap, or Gentoo)
na.rm = TRUE ignores missing values when checking numeric ranges
The function prints different messages depending on whether data passes or fails each check

Three types of validation demonstrated:

Structural: Are species names valid categories? (Adelie, Chinstrap, Gentoo)
Content: Are bill lengths within the documented range? (30-60mm)
Domain: Are body masses biologically plausible? (2500-6500g)

When you run this function on the clean data, all checks pass. When you run it on the messy data, it fails multiple checks and shows you exactly where problems exist.

Integrating Validation into Your Documentation

Remember the penguins_documentation.qmd file you created in the data dictionaries section? Let’s add validation checks to it! To keep things simple, we’ll add a new section that summarizes key validation statistics.

Exercise: Add Validation to Your Living Document

Update your penguins_documentation.qmd file to include validation checks.

Add this section after the automated codebook:

## Data Validation

Below are summary statistics to check data quality:

```{r}
#| echo: false
#| message: false

# Quick validation checks
cat("=== Data Quality Summary ===\n\n")

# Check for data quality issues in numeric variables
cat("Numeric Variables - Range Checks:\n")
cat("Bill length range:", range(penguins$bill_length_mm, na.rm = TRUE), "mm\n")
cat("Bill depth range:", range(penguins$bill_depth_mm, na.rm = TRUE), "mm\n")
cat("Body mass range:", range(penguins$body_mass_g, na.rm = TRUE), "g\n\n")

# Check categorical variables for unexpected values
cat("Categorical Variables - Unique Value Counts:\n")
cat("Species (expect 3):", length(unique(penguins$species)), "\n")
cat("Islands (expect 3):", length(unique(penguins$island)), "\n")
cat("Years (expect 3):", length(unique(penguins$year)), "\n\n")

# Check missing values
cat("Missing Values by Variable:\n")
print(colSums(is.na(penguins)))
```

Then render the document to see your complete data dictionary with built-in validation!

Why this is powerful:

Data dictionary + validation in one place gives complete documentation
Updates automatically when data changes
Easy to share with collaborators
Documents both what data should be and what it actually is

Advanced Tools for Data Validation

As you become more comfortable with data validation, you may want to explore specialized R packages that automate quality checks and generate professional reports:

{pointblank} (Iannone, Vargas, and Choe 2024) - Comprehensive data validation toolkit

Define validation rules (e.g., “bill length must be 30-60mm”)
Generate HTML validation reports with pass/fail results
Set quality thresholds and get alerts when data fails checks
Best for: Projects with many datasets or frequent data updates

{validate} (van der Loo and de Jonge 2021) - Rule-based data validation

Create validation rules in plain language syntax
Check data against your rules and get detailed reports
Track which rows fail which rules
Best for: Complex validation logic across multiple variables

{assertr} (Fischetti 2023) - Defensive programming for data pipelines

Add validation checks directly into your data processing code
Stop execution if data fails quality checks
Best for: Automated workflows and data pipelines

{skimr} (Waring et al. 2025) - Enhanced summary statistics

More informative summaries than base R summary()
Includes histograms and additional statistics
Best for: Quick exploratory data validation

These tools build on the concepts you’ve learned in this tutorial. Start with the base R functions you now know, and explore these packages when your validation needs grow more complex.

Conclusion

Congratulations! You’ve completed this tutorial on data documentation and validation in R. You now know how to:

Create data dictionaries manually and automatically with {datawizard}
Identify data quality issues using summary statistics
Integrate documentation and validation into reproducible Quarto documents
Build a complete data quality workflow for your research

These skills form the foundation for reproducible, transparent, and trustworthy research. By documenting and validating your data systematically, you’re contributing to better science and making your work more accessible to others.

Your next steps: Apply these techniques to your own research data. Start small with one dataset, create its documentation, validate it, and build from there. Good data practices become easier with practice!

References

Fischetti, Tony. 2023. Assertr: Assertive Programming for r Analysis Pipelines. https://CRAN.R-project.org/package=assertr.

Iannone, Richard, Mauricio Vargas, and June Choe. 2024. Pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. https://CRAN.R-project.org/package=pointblank.

van der Loo, Mark P. J., and Edwin de Jonge. 2021. “Data Validation Infrastructure for R.” Journal of Statistical Software 97 (10): 1–31. https://doi.org/10.18637/jss.v097.i10.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.