Validation with R Packages

You’ve learned to identify data quality issues through summary statistics. Now let’s automate this process using {pointblank}, a package that systematically checks your data against rules you define and generates comprehensive validation reports.

From Manual Checks to Automated Validation

Building on Summary Statistics

In the previous section, you manually examined the messy Palmer Penguins data and spotted issues. But what if:

You need to check data every time it’s updated?
You have dozens of datasets with similar structures?
You want to document your quality control process?
You need to share validation procedures with collaborators?

The {pointblank} package automates validation while creating professional reports.

The `{pointblank}` Package

{pointblank} provides two workflows:

Data Quality Reporting (agent-based): Creates detailed HTML validation reports
Information Management (informant-based): Creates data dictionaries with metadata

We’ll focus on validation (you already learned dictionary creation in Section 2).

Installing pointblank

# Install if not already done
install.packages("pointblank")

library(pointblank)
library(readr)
library(dplyr)

Creating a Validation Agent

Think of an agent as a quality inspector that checks your data against rules:

# Load the messy data we created
penguins_messy <- read_csv("data/penguins_messy.csv", show_col_types = FALSE)

# Create an agent to validate the messy data
agent <-
  create_agent(
    tbl = penguins_messy,
    tbl_name = "penguins_messy",
    label = "Palmer Penguins Validation"
  )

Defining Validation Rules

Add validation rules based on your data dictionary. Each rule is a function that checks a specific quality criterion:

Step 1: Basic Structure Checks

agent <- agent %>%
  # Check data dimensions
  rows_distinct() %>%
  col_exists(columns = c("species", "island", "bill_length_mm",
                        "bill_depth_mm", "flipper_length_mm",
                        "body_mass_g", "sex", "year"))

Building Rules Progressively

Start with basic checks (does the column exist?) before moving to content checks (are the values valid?). This helps you quickly identify fundamental issues.

Step 2: Data Type Validation

agent <- agent %>%
  # Numeric columns should be numeric
  col_is_numeric(columns = c("bill_length_mm", "bill_depth_mm",
                            "flipper_length_mm", "body_mass_g", "year"))

Step 3: Range Validation

Check if values fall within expected ranges from your data dictionary:

Show range validation code

agent <- agent %>%
  # Bill length should be between 25-70mm
  col_vals_between(
    columns = bill_length_mm,
    left = 25, right = 70,
    na_pass = TRUE,  # Allow NA values
    label = "Bill length in valid range (25-70mm)"
  ) %>%
  # Bill depth should be between 10-25mm
  col_vals_between(
    columns = bill_depth_mm,
    left = 10, right = 25,
    na_pass = TRUE,
    label = "Bill depth in valid range (10-25mm)"
  ) %>%
  # Flipper length should be between 150-250mm
  col_vals_between(
    columns = flipper_length_mm,
    left = 150, right = 250,
    na_pass = TRUE,
    label = "Flipper length in valid range (150-250mm)"
  ) %>%
  # Body mass should be between 2000-7000g
  col_vals_between(
    columns = body_mass_g,
    left = 2000, right = 7000,
    na_pass = TRUE,
    label = "Body mass in valid range (2000-7000g)"
  ) %>%
  # Year should be 2007, 2008, or 2009
  col_vals_in_set(
    columns = year,
    set = c(2007, 2008, 2009),
    label = "Year is within study period"
  )

Step 4: Categorical Value Checks

Ensure categorical variables contain only expected values:

Show categorical validation code

agent <- agent %>%
  # Species should be one of three types
  col_vals_in_set(
    columns = species,
    set = c("Adelie", "Chinstrap", "Gentoo"),
    label = "Species is valid (Adelie, Chinstrap, or Gentoo)"
  ) %>%
  # Island should be one of three islands
  col_vals_in_set(
    columns = island,
    set = c("Torgersen", "Biscoe", "Dream"),
    label = "Island is valid (Torgersen, Biscoe, or Dream)"
  ) %>%
  # Sex should be male or female
  # Note: NA values will be flagged but may be acceptable for this variable
  col_vals_in_set(
    columns = sex,
    set = c("male", "female"),
    label = "Sex is male or female"
  )

Step 5: Impossible Value Checks

Check for biologically impossible values:

agent <- agent %>%
  # No negative measurements
  col_vals_gt(
    columns = bill_length_mm,
    value = 0,
    na_pass = TRUE,
    label = "Bill length is positive"
  ) %>%
  col_vals_gt(
    columns = body_mass_g,
    value = 0,
    na_pass = TRUE,
    label = "Body mass is positive"
  )

Running the Validation

Execute all validation rules and generate a report:

# Interrogate the data (run all validations)
agent <- interrogate(agent)

# View the report
agent

The agent creates an interactive HTML report showing: - Overview: How many tests passed/failed - Each validation rule: With pass/fail status and failure counts - Severity levels: Which issues are critical vs. warnings - Sample failures: Examples of rows that failed each test

Interpreting Results

When you run validation on penguins_messy, you should see failures for: - Species: 4 issues (typos like “Adelei”, case errors like “adelie”, extra text like “Gentoo penguin”) - Island: 2 issues (typo “Torgerson”, case error “biscoe”) - Bill length: 3 issues (negative -5.2, impossibly large 250.5, placeholder 999) - Bill depth: 2 issues (negative -2.1, placeholder 99.9) - Flipper length: 1 issue (zero value) - Body mass: 3 issues (extreme outliers: 15000g, 500g, 10000g) - Sex: 22 failures total (11 NAs that are acceptable + 11 actual issues with inconsistent coding: M/F/male/Male/MALE/m/Female) - Year: 4 issues (2020, 2006, 207, 20009 - outside 2007-2009 range or digit errors)

Total: 30 data quality issues we intentionally added for practice! Note that the sex validation will also flag 11 NA values (which are acceptable in this dataset), bringing the total reported failures to 41.

Comparing Clean vs. Messy Data

Let’s validate both versions to see the difference:

# Load and validate clean data
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

agent_clean <-
  create_agent(
    tbl = penguins_clean,
    tbl_name = "penguins_clean",
    label = "Palmer Penguins (Clean) Validation"
  ) %>%
  # Add all the same validation rules as above
  col_vals_between(columns = bill_length_mm, left = 25, right = 70, na_pass = TRUE) %>%
  col_vals_between(columns = body_mass_g, left = 2000, right = 7000, na_pass = TRUE) %>%
  col_vals_in_set(columns = species, set = c("Adelie", "Chinstrap", "Gentoo")) %>%
  col_vals_in_set(columns = year, set = c(2007, 2008, 2009)) %>%
  interrogate()

# View clean data report
agent_clean

# Compare: clean should pass all checks, messy should fail several

Setting Failure Thresholds

Define what level of failures is acceptable:

agent <-
  create_agent(
    tbl = penguins_messy,
    tbl_name = "penguins_messy"
  ) %>%
  col_vals_between(
    columns = bill_length_mm,
    left = 25, right = 70,
    na_pass = TRUE,
    actions = action_levels(warn_at = 0.01, stop_at = 0.05)
    # Warn if >1% failures, stop if >5% failures
  ) %>%
  interrogate()

Threshold Strategy

Set thresholds based on your research needs: - stop_at = 0: Zero tolerance (data analysis stops if any failures) - warn_at = 0.01: Flag if >1% of rows fail (for investigation) - notify_at = 0.001: Notification if >0.1% fail (minor issues)

For published datasets, use strict thresholds. For preliminary data collection, more lenient thresholds may be appropriate.

Creating Reusable Validation Functions

Build a validation workflow you can apply to any similar dataset:

Show reusable validation function

validate_penguin_data <- function(data, data_name = "Penguin Data") {

  agent <-
    create_agent(
      tbl = data,
      tbl_name = data_name,
      label = paste(data_name, "Validation Report")
    ) %>%
    # Structure checks
    col_exists(columns = c("species", "island", "bill_length_mm",
                          "bill_depth_mm", "flipper_length_mm",
                          "body_mass_g", "sex", "year")) %>%
    col_is_numeric(columns = c("bill_length_mm", "bill_depth_mm",
                              "flipper_length_mm", "body_mass_g", "year")) %>%
    # Range checks
    col_vals_between(columns = bill_length_mm, left = 25, right = 70,
                    na_pass = TRUE, label = "Bill length 25-70mm") %>%
    col_vals_between(columns = bill_depth_mm, left = 10, right = 25,
                    na_pass = TRUE, label = "Bill depth 10-25mm") %>%
    col_vals_between(columns = flipper_length_mm, left = 150, right = 250,
                    na_pass = TRUE, label = "Flipper length 150-250mm") %>%
    col_vals_between(columns = body_mass_g, left = 2000, right = 7000,
                    na_pass = TRUE, label = "Body mass 2000-7000g") %>%
    # Categorical checks
    col_vals_in_set(columns = species,
                   set = c("Adelie", "Chinstrap", "Gentoo"),
                   label = "Valid species") %>%
    col_vals_in_set(columns = island,
                   set = c("Torgersen", "Biscoe", "Dream"),
                   label = "Valid island") %>%
    col_vals_in_set(columns = sex, set = c("male", "female"),
                   label = "Valid sex") %>%
    col_vals_in_set(columns = year, set = c(2007, 2008, 2009),
                   label = "Valid year") %>%
    # Impossible value checks
    col_vals_gt(columns = bill_length_mm, value = 0, na_pass = TRUE,
               label = "Positive bill length") %>%
    col_vals_gt(columns = body_mass_g, value = 0, na_pass = TRUE,
               label = "Positive body mass") %>%
    # Run validation
    interrogate()

  return(agent)
}

# Use the function
messy_validation <- validate_penguin_data(penguins_messy, "Messy Penguins")
clean_validation <- validate_penguin_data(penguins_clean, "Clean Penguins")

# View reports
messy_validation
clean_validation

Exporting Validation Reports

Share validation results with collaborators:

# Export as HTML file
export_report(agent, filename = "penguin_validation_report.html")

# The HTML file can be:
# - Shared via email
# - Posted on project websites
# - Included in data repositories
# - Attached to manuscripts as supplementary material

Integration with Your Workflow

Validation as Quality Gate

Make validation a required step before analysis:

# Validation function that stops if critical failures occur
safe_analysis_workflow <- function(data) {

  # Step 1: Validate
  cat("Step 1: Validating data...\n")
  agent <- validate_penguin_data(data, "Analysis Data")

  # Step 2: Check if validation passed
  if (all_passed(agent)) {
    cat("Validation passed! Proceeding with analysis.\n")

    # Step 3: Your analysis code here
    # analysis_results <- run_analysis(data)

  } else {
    cat("Validation failed! Review issues before analyzing.\n")
    cat("See validation report for details.\n")
    return(agent)  # Return agent to review failures
  }
}

# Use in your workflow
safe_analysis_workflow(penguins_messy)  # Should stop
safe_analysis_workflow(penguins_clean)  # Should proceed

Best Practices

Validate early: Check data quality before investing time in analysis
Validate often: Re-run validation each time data is updated
Document thresholds: Explain why you set specific failure thresholds
Share validation code: Include in analysis scripts or data repositories
Archive reports: Keep validation reports with analysis outputs

Key Takeaways

Systematic validation with {pointblank} provides:

Automated checking: No manual inspection of every value
Comprehensive reports: Professional HTML documentation
Reproducibility: Same validation rules applied consistently
Collaboration: Share validation procedures with colleagues
Confidence: Know your data meets quality standards

Reflection

Compare your experience with manual checking (Section 3.2) versus automated validation. Which issues were easier to catch with each approach? How might combining both methods strengthen your data quality workflow?

Congratulations!

You’ve completed this tutorial on data documentation and validation! You now have the skills to:

Create comprehensive data dictionaries (manual and automated)
Use summary statistics to identify data quality issues
Implement systematic validation workflows with R packages
Generate professional validation reports

These skills will help you maintain high data quality standards, collaborate more effectively, and contribute to reproducible research practices in your field.

From Manual Checks to Automated Validation

Building on Summary Statistics

The {pointblank} Package

Installing pointblank

Creating a Validation Agent

Defining Validation Rules

Step 1: Basic Structure Checks

Step 2: Data Type Validation

Step 3: Range Validation

Step 4: Categorical Value Checks

Step 5: Impossible Value Checks

Running the Validation

Comparing Clean vs. Messy Data

Setting Failure Thresholds

Creating Reusable Validation Functions

Exporting Validation Reports

Integration with Your Workflow

Validation as Quality Gate

Key Takeaways

Congratulations!

The `{pointblank}` Package