Summary Statistics

Summary statistics are your first line of defense in data validation. They provide a systematic way to spot quality issues, understand your data’s characteristics, and verify that reality matches your documented expectations. Think of them as a health check-up for your dataset.

Summary Statistics as Data Detectives

Beyond Basic Means and Medians

While means and medians are useful, data validation requires a broader perspective. You need statistics that reveal:

  • Range issues: Are values where they should be?
  • Missing patterns: How much data is absent?
  • Distribution problems: Are there unexpected spikes or gaps?
  • Consistency issues: Do related variables align logically?

Building on Your Documentation

Remember the data dictionary you created? Summary statistics help verify whether your actual data matches what you documented:

  • Are the ranges you specified actually present in the data?
  • Do the data types match what you expected?
  • Are missing values patterns consistent with what you documented?

Detection Exercise

Consider the Palmer Penguins data. If you calculated summary statistics and found:

  • Bill length minimum: -5.2mm
  • Body mass maximum: 95,000g
  • Species count: 5 different values

What would each finding suggest about data quality?

  • Negative bill length indicates data entry errors or unit confusion.
  • An extremely high body mass suggests impossible values or typos.
  • More species than expected points to typos or inconsistent naming.

Getting Started with Base R

R comes with built-in functions for calculating summary statistics. Let’s start by loading some data and exploring it with these base R tools. This will help you understand both what summary statistics reveal and why specialized packages can make the process easier.

Loading the Data

library(readr)

# Load clean penguins data from CSV
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Load messy penguins data (we'll compare them later)
penguins_messy <- read_csv("data/penguins_messy.csv", show_col_types = FALSE)

Using summary() to Spot Issues

The summary() function provides a quick overview of your entire dataset. Let’s use it to compare clean and messy data and see what issues we can spot:

# Compare clean and messy data
cat("=== CLEAN DATA ===\n")
summary(penguins_clean)

cat("\n\n=== MESSY DATA ===\n")
summary(penguins_messy)

What summary() Shows

For numeric variables, summary() displays:

  • Minimum and maximum values
  • Quartiles (25th, 50th/median, 75th percentiles)
  • Mean
  • Count of missing values (NA’s)

For character/categorical variables, it shows:

  • Length and class information
TipExercise: Spot the Differences

Compare the clean and messy outputs. Look for:

  • Negative values where they shouldn’t exist (e.g., negative bill lengths)
  • Extremely large or small numbers (outliers or data entry errors)
  • Unexpected NA counts
  • Check if the minimum and maximum values make biological sense

What issues can you spot in the messy data?

Digging Deeper: Individual Variable Functions

While summary() is great for a quick overview, sometimes you need more detailed information about specific variables. Let us inspect individual variables using base R functions:

# Detailed examination of a numeric variable
cat("=== Bill Length (detailed) ===\n")
cat("Range:", min(penguins_messy$bill_length_mm, na.rm = TRUE), "to",
    max(penguins_messy$bill_length_mm, na.rm = TRUE), "\n")
cat("Mean:", round(mean(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Median:", round(median(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Standard deviation:", round(sd(penguins_messy$bill_length_mm, na.rm = TRUE), 2), "\n")
cat("Missing values:", sum(is.na(penguins_messy$bill_length_mm)), "\n")

# Detailed examination of a categorical variable
cat("\n=== Species (detailed) ===\n")
table(penguins_messy$species, useNA = "always")
cat("\nUnique values:", unique(penguins_messy$species), "\n")
cat("Number of unique values:", length(unique(penguins_messy$species)), "\n")

For numeric variables:

  • min() / max(): Find range boundaries
  • mean(): Average (affected by outliers)
  • median(): Middle value (robust to outliers)
  • sd(): Standard deviation (variability)
  • sum(is.na()): Count missing values

For categorical variables:

  • table(): Frequency counts for each category
  • unique(): List all distinct values
  • length(unique()): Count how many distinct values

Important: Always use na.rm = TRUE with numeric functions to handle missing values!

Building Custom Validation Functions

Now that you understand the individual functions, let’s combine them into reusable validation functions that you can adapt for your own datasets.

Detailed Numeric Variable Checks

Create a function to examine numeric variables systematically:

# Function to get comprehensive numeric summary
examine_numeric <- function(variable, var_name) {
  cat("\n=== ", var_name, " ===\n")
  cat("Range:", min(variable, na.rm = TRUE), "to", max(variable, na.rm = TRUE), "\n")
  cat("Mean:", round(mean(variable, na.rm = TRUE), 2), "\n")
  cat("Median:", round(median(variable, na.rm = TRUE), 2), "\n")
  cat("Missing values:", sum(is.na(variable)), "\n")
  
  # Check for potential outliers
  q1 <- quantile(variable, 0.25, na.rm = TRUE)
  q3 <- quantile(variable, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  
  outliers <- sum(variable < lower_bound | variable > upper_bound, na.rm = TRUE)
  cat("Potential outliers:", outliers, "\n")
}

# Apply to messy data to find issues
examine_numeric(penguins_messy$bill_length_mm, "Bill Length (mm)")
examine_numeric(penguins_messy$bill_depth_mm, "Bill Depth (mm)")
examine_numeric(penguins_messy$flipper_length_mm, "Flipper Length (mm)")
examine_numeric(penguins_messy$body_mass_g, "Body Mass (g)")

Categorical Variable Validation

Check categorical variables for unexpected values:

# Examine categorical variables in messy data
cat("=== Species ===\n")
table(penguins_messy$species, useNA = "always")

cat("\n=== Island ===\n")
table(penguins_messy$island, useNA = "always")

cat("\n=== Sex ===\n")
table(penguins_messy$sex, useNA = "always")

cat("\n=== Year ===\n")
table(penguins_messy$year, useNA = "always")
TipValidation Questions

For each categorical variable, ask:

  • Are all the values ones you expected?
  • Are there any misspellings or variant spellings?
  • Do the frequencies seem reasonable?
  • Are there unexpected missing values?

For the Palmer Penguins data, you should see exactly 3 species, 3 islands, 2 sexes (plus missing values), and 3 years.

Checking Missing Values

Missing values are normal in real data, but you need to verify they match your expectations from the data dictionary:

# Quick check: How many missing values in each variable?
colSums(is.na(penguins_messy))

# Total missing values in the entire dataset
sum(is.na(penguins_messy))
TipValidation Questions for Missing Data

When checking missing values, ask:

  • Are there MORE NAs than expected? Compare to your data dictionary
  • Are there NAs in variables that should be complete? (e.g., ID columns, required measurements)
  • Are the NAs in expected variables? (e.g., sex might have NAs if not recorded, but species should always be known)

For the penguins data, sex can have missing values (animal couldn’t be sexed in the field), but measurements like bill_length_mm should rarely be missing if the penguin was measured.

Identifying Quality Issues

Red Flags in Summary Statistics

For Numeric Variables:

  • Negative values where impossible (e.g., negative body mass)
  • Extremely large or small values (e.g., 50,000g penguins)
  • All values the same (suggests data entry error)
  • Unexpected missing value patterns

For Categorical Variables:

  • Unexpected categories (e.g., “Adelei” instead of “Adelie”)
  • Too many or too few unique values
  • Categories that should be mutually exclusive appearing together
TipUsing Your Data Dictionary

Always compare your summary statistics to what you documented in your data dictionary:

  • Do the min/max values match the ranges you specified?
  • Are there more unique categories than you documented?
  • Are there more missing values than expected?
  • Do the data types match?

If you find discrepancies, investigate! Either the data has quality issues, or your documentation needs updating.

Building Your Summary Statistics Workflow

  1. Start broad: Use summary() to get an overall picture
  2. Go specific: Examine each variable type systematically
  3. Check expectations: Compare findings to your data dictionary
  4. Investigate anomalies: Follow up on anything unexpected
  5. Document findings: Keep track of quality issues discovered
Tip

Create a standard summary statistics script that you can adapt for different datasets in your research area. Save the functions you’ve learned (examine_numeric, table(), etc.) in a script you can reuse.

Limitations of Manual Checking

While base R summary statistics are powerful, manual checking has limitations:

Time-Consuming:

  • Checking every variable manually takes significant time
  • Repetitive work for each new dataset or data update

Error-Prone:

  • Easy to forget to check a variable
  • Inconsistent standards across different checking sessions
  • No systematic record of what was checked

Hard to Share:

  • Difficult to communicate your validation process to collaborators
  • No formal documentation of quality checks performed
  • Can’t easily reproduce the same checks on updated data

The Solution: Automated validation with {pointblank} (next section) addresses all these issues while building on the concepts you’ve learned here.

Here’s a handy reference of the base R functions we used in this tutorial:

For Numeric Variables

Range Statistics

  • min(x, na.rm = TRUE): Minimum value
  • max(x, na.rm = TRUE): Maximum value
  • range(x, na.rm = TRUE): Both min and max
  • quantile(x, na.rm = TRUE): Quartiles and percentiles

Central Tendency

  • mean(x, na.rm = TRUE): Average value
  • median(x, na.rm = TRUE): Middle value

Variability

  • sd(x, na.rm = TRUE): Standard deviation
  • IQR(x, na.rm = TRUE): Interquartile range

Missing Data

  • sum(is.na(x)): Count of missing values
  • complete.cases(data): Rows with no missing values

For Categorical Variables

Frequency Counts

  • table(x, useNA = "always"): Frequency table including NAs
  • prop.table(table(x)): Proportions instead of counts

Unique Values

  • unique(x): All distinct values
  • length(unique(x)): Count of unique values

For Entire Datasets

  • summary(data): Quick overview of all variables
  • nrow(data), ncol(data): Dataset dimensions
  • sapply(data, function): Apply a function to each column
NoteRemember: na.rm = TRUE

Most numeric functions require na.rm = TRUE to handle missing values. Without it, any NA in your data will cause the function to return NA.

Key Takeaways

Summary statistics with base R are essential for data validation:

  • Base R functions (summary(), table(), min(), max(), etc.) provide powerful tools for examining data quality
  • Compare clean vs. messy data to understand what issues look like
  • Build on your data dictionary by verifying actual data matches documented expectations
  • Create reusable functions for consistent quality checks across projects
NoteReflection

You’ve now seen how summary statistics reveal data quality issues. In the messy dataset, you should have spotted impossible values, typos, and inconsistencies. How confident would you be analyzing data without these checks?

The Challenge: While base R summary statistics help you identify issues, manually checking every variable in every dataset is time-consuming and error-prone. What if you need to validate data every time it’s updated? What if you have dozens of datasets with similar structures?

What’s Next?

In the next section, you’ll learn to use {pointblank} to automate this entire validation process: - Define validation rules once, apply them automatically - Generate professional HTML validation reports - Set thresholds for acceptable data quality - Integrate validation into your analysis workflow

The manual checking skills you learned here will help you understand what pointblank is checking and why certain validations matter.

Back to top