Summary Statistics

Summary statistics are your first line of defense in data validation. They provide a systematic way to spot quality issues, understand your data’s characteristics, and verify that reality matches your documented expectations.

NoteWhat You’ll Learn

By the end of this section, you will:

  • Use summary() to get a quick overview of data quality
  • Apply base R functions to check numeric and categorical variables
  • Compare clean vs. messy data to spot problems
  • Identify errors in the messy Palmer Penguins dataset

How Summary Statistics Reveal Problems

Summary statistics help you verify that your actual data matches what you documented in your data dictionary:

  • Range checks: Are values within expected bounds?
  • Missing data: How much data is absent?
  • Category checks: Are there unexpected values or typos?

For example, if your data dictionary says bill length should be 30-60mm, but min() returns -5.2mm, you’ve found a data quality problem.

Load and Compare the Data

Let’s load both the clean and messy Palmer Penguins datasets:

library(readr)

# Load clean penguins data from CSV
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Load messy penguins data (we'll compare them later)
penguins_messy <- read_csv("data/penguins_messy.csv", show_col_types = FALSE)

Using summary() to Spot Issues

The summary() function provides a quick overview of your entire dataset. Let’s use it to compare clean and messy data and see what issues we can spot:

# Compare clean and messy data
cat("=== CLEAN DATA ===\n")
summary(penguins_clean)

cat("\n\n=== MESSY DATA ===\n")
summary(penguins_messy)

What summary() Shows

For numeric variables, summary() displays:

  • Minimum and maximum values
  • Quartiles (25th, 50th/median, 75th percentiles)
  • Mean
  • Count of missing values (NA’s)

For character/categorical variables, it shows:

  • Length and class information
TipExercise: Spot the Differences

Compare the clean and messy outputs. Look for:

  • Negative values where they shouldn’t exist (e.g., negative bill lengths)
  • Extremely large or small numbers (outliers or data entry errors)
  • Unexpected NA counts
  • Check if the minimum and maximum values make biological sense

What issues can you spot in the messy data?

Check Categorical Variables

Now check for typos and unexpected values in categorical variables:

# Examine categorical variables in messy data
cat("=== Species ===\n")
table(penguins_messy$species, useNA = "always")

cat("\n=== Island ===\n")
table(penguins_messy$island, useNA = "always")

cat("\n=== Sex ===\n")
table(penguins_messy$sex, useNA = "always")

cat("\n=== Year ===\n")
table(penguins_messy$year, useNA = "always")

What to look for:

  • Are all values ones you expected? (3 species, 3 islands, 2 sexes, 3 years)
  • Are there typos or variant spellings?
  • Are there unexpected missing values?

Check Numeric Variables in Detail

While summary() is great for a quick overview, sometimes you may want to focus on specific variables. Let us inspect individual variables using base R functions and printing statistics that makes sense for your data:

# Detailed examination of a numeric variable
cat("=== Bill Length (detailed) ===\n")
cat("Range:", min(penguins_messy$bill_length_mm, na.rm = TRUE), "to",
    max(penguins_messy$bill_length_mm, na.rm = TRUE), "\n")
cat("Mean:", round(mean(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Median:", round(median(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Standard deviation:", round(sd(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Missing values:", sum(is.na(penguins_messy$bill_length_mm)), "\n")

Check Missing Values

# How many missing values in each variable?
colSums(is.na(penguins_messy))

Compare to your data dictionary: sex can have NAs (couldn’t be determined in field), but species should never be missing.

For numeric variables:

  • min() / max(): Find range boundaries
  • mean(): Average (affected by outliers)
  • median(): Middle value (robust to outliers)
  • sd(): Standard deviation (variability)
  • sum(is.na()): Count missing values

For categorical variables:

  • table(): Frequency counts for each category
  • unique(): List all distinct values
  • length(unique()): Count how many distinct values

Important: Always use na.rm = TRUE with numeric functions to handle missing values!

TipCheckpoint: What Did You Find?

After running the code above on the messy data, you should have spotted:

Numeric problems:

  • Negative bill lengths
  • Impossibly large or small values
  • Placeholder values (999)

Categorical problems:

  • Typos in species names (“Adelei”, “ADELIE”, “Gentoo penguin”)
  • Typos in island names (“Torgerson”, “biscoe”)
  • Inconsistent sex coding (“M”, “F”, “male”, “Male”, “MALE”)
  • Invalid years (2020, 2006, 207, 20009)

Did you find these issues? If so, you’ve successfully used summary statistics for data validation!

Summary Statistics Workflow

When validating new data:

  1. Use summary() for a quick overview
  2. Check categorical variables with table() and unique()
  3. Check numeric ranges with min(), max(), range()
  4. Check missing values with colSums(is.na())
  5. Compare findings to your data dictionary and expectations

Automating Validation with Functions

You can wrap these checks into a function that automatically validates data based on your documented rules:

# Function to validate penguins data based on data dictionary rules
validate_penguins <- function(data) {
  cat("=== VALIDATION REPORT ===\n\n")

  # Structural check: Valid species (from data dictionary)
  valid_species <- c("Adelie", "Chinstrap", "Gentoo")
  if (all(data$species %in% c(valid_species, NA))) {
    cat("Species: All values valid\n")
  } else {
    cat("Species: Invalid values found\n")
  }

  # Content check: Bill length range (from data dictionary: 30-60mm)
  if (all(data$bill_length_mm >= 30 & data$bill_length_mm <= 60, na.rm = TRUE)) {
    cat("Bill length: All values in valid range (30-60mm)\n")
  } else {
    cat("Bill length: Out-of-range values found\n")
  }

  # Domain check: Body mass biologically plausible (2500-6500g)
  if (all(data$body_mass_g >= 2500 & data$body_mass_g <= 6500, na.rm = TRUE)) {
    cat("Body mass: All values biologically plausible\n")
  } else {
    cat("Body mass: Implausible values found\n")
  }
}

# Test with clean data
cat("CLEAN DATA:\n")
validate_penguins(penguins_clean)

# Test with messy data
cat("\n\nMESSY DATA:\n")
validate_penguins(penguins_messy)

How this function works:

  • validate_penguins() takes a data frame as input and checks it against rules from your data dictionary
  • all() checks if every value meets a condition (returns TRUE only if all values pass)
  • %in% checks if values are in the allowed list (e.g., species must be Adelie, Chinstrap, or Gentoo)
  • na.rm = TRUE ignores missing values when checking numeric ranges
  • The function prints different messages depending on whether data passes or fails each check

Three types of validation demonstrated:

  • Structural: Are species names valid categories? (Adelie, Chinstrap, Gentoo)
  • Content: Are bill lengths within the documented range? (30-60mm)
  • Domain: Are body masses biologically plausible? (2500-6500g)

When you run this function on the clean data, all checks pass. When you run it on the messy data, it fails multiple checks and shows you exactly where problems exist.

Integrating Validation into Your Documentation

Remember the penguins_documentation.qmd file you created in the data dictionaries section? Let’s add validation checks to it! To keep things simple, we’ll add a new section that summarizes key validation statistics.

TipExercise: Add Validation to Your Living Document

Update your penguins_documentation.qmd file to include validation checks.

Add this section after the automated codebook:

## Data Validation

Below are summary statistics to check data quality:

```{r}
#| echo: false
#| message: false

# Quick validation checks
cat("=== Data Quality Summary ===\n\n")

# Check for data quality issues in numeric variables
cat("Numeric Variables - Range Checks:\n")
cat("Bill length range:", range(penguins$bill_length_mm, na.rm = TRUE), "mm\n")
cat("Bill depth range:", range(penguins$bill_depth_mm, na.rm = TRUE), "mm\n")
cat("Body mass range:", range(penguins$body_mass_g, na.rm = TRUE), "g\n\n")

# Check categorical variables for unexpected values
cat("Categorical Variables - Unique Value Counts:\n")
cat("Species (expect 3):", length(unique(penguins$species)), "\n")
cat("Islands (expect 3):", length(unique(penguins$island)), "\n")
cat("Years (expect 3):", length(unique(penguins$year)), "\n\n")

# Check missing values
cat("Missing Values by Variable:\n")
print(colSums(is.na(penguins)))
```

Then render the document to see your complete data dictionary with built-in validation!

Why this is powerful:

  • Data dictionary + validation in one place gives complete documentation
  • Updates automatically when data changes
  • Easy to share with collaborators
  • Documents both what data should be and what it actually is

As you become more comfortable with data validation, you may want to explore specialized R packages that automate quality checks and generate professional reports:

{pointblank} (Iannone, Vargas, and Choe 2024) - Comprehensive data validation toolkit

  • Define validation rules (e.g., “bill length must be 30-60mm”)
  • Generate HTML validation reports with pass/fail results
  • Set quality thresholds and get alerts when data fails checks
  • Best for: Projects with many datasets or frequent data updates

{validate} (van der Loo and de Jonge 2021) - Rule-based data validation

  • Create validation rules in plain language syntax
  • Check data against your rules and get detailed reports
  • Track which rows fail which rules
  • Best for: Complex validation logic across multiple variables

{assertr} (Fischetti 2023) - Defensive programming for data pipelines

  • Add validation checks directly into your data processing code
  • Stop execution if data fails quality checks
  • Best for: Automated workflows and data pipelines

{skimr} (Waring et al. 2025) - Enhanced summary statistics

  • More informative summaries than base R summary()
  • Includes histograms and additional statistics
  • Best for: Quick exploratory data validation

These tools build on the concepts you’ve learned in this tutorial. Start with the base R functions you now know, and explore these packages when your validation needs grow more complex.

Conclusion

Congratulations! You’ve completed this tutorial on data documentation and validation in R. You now know how to:

  • Create data dictionaries manually and automatically with {datawizard}
  • Identify data quality issues using summary statistics
  • Integrate documentation and validation into reproducible Quarto documents
  • Build a complete data quality workflow for your research

These skills form the foundation for reproducible, transparent, and trustworthy research. By documenting and validating your data systematically, you’re contributing to better science and making your work more accessible to others.

Your next steps: Apply these techniques to your own research data. Start small with one dataset, create its documentation, validate it, and build from there. Good data practices become easier with practice!

Back to top

References

Fischetti, Tony. 2023. Assertr: Assertive Programming for r Analysis Pipelines. https://CRAN.R-project.org/package=assertr.
Iannone, Richard, Mauricio Vargas, and June Choe. 2024. Pointblank: Data Validation and Organization of Metadata for Local and Remote Tables. https://CRAN.R-project.org/package=pointblank.
van der Loo, Mark P. J., and Edwin de Jonge. 2021. “Data Validation Infrastructure for R.” Journal of Statistical Software 97 (10): 1–31. https://doi.org/10.18637/jss.v097.i10.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2025. Skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.