# Install if not already done
install.packages("pointblank")
library(pointblank)
library(readr)
library(dplyr)
Validation with R Packages
You’ve learned to identify data quality issues through summary statistics. Now let’s automate this process using {pointblank}
, a package that systematically checks your data against rules you define and generates comprehensive validation reports.
From Manual Checks to Automated Validation
Building on Summary Statistics
In the previous section, you manually examined the messy Palmer Penguins data and spotted issues. But what if:
- You need to check data every time it’s updated?
- You have dozens of datasets with similar structures?
- You want to document your quality control process?
- You need to share validation procedures with collaborators?
The {pointblank}
package automates validation while creating professional reports.
The {pointblank}
Package
{pointblank}
provides two workflows:
- Data Quality Reporting (agent-based): Creates detailed HTML validation reports
- Information Management (informant-based): Creates data dictionaries with metadata
We’ll focus on validation (you already learned dictionary creation in Section 2).
Installing pointblank
Creating a Validation Agent
Think of an agent as a quality inspector that checks your data against rules:
# Load the messy data we created
<- read_csv("data/penguins_messy.csv", show_col_types = FALSE)
penguins_messy
# Create an agent to validate the messy data
<-
agent create_agent(
tbl = penguins_messy,
tbl_name = "penguins_messy",
label = "Palmer Penguins Validation"
)
Defining Validation Rules
Add validation rules based on your data dictionary. Each rule is a function that checks a specific quality criterion:
Step 1: Basic Structure Checks
<- agent %>%
agent # Check data dimensions
rows_distinct() %>%
col_exists(columns = c("species", "island", "bill_length_mm",
"bill_depth_mm", "flipper_length_mm",
"body_mass_g", "sex", "year"))
Start with basic checks (does the column exist?) before moving to content checks (are the values valid?). This helps you quickly identify fundamental issues.
Step 2: Data Type Validation
<- agent %>%
agent # Numeric columns should be numeric
col_is_numeric(columns = c("bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g", "year"))
Step 3: Range Validation
Check if values fall within expected ranges from your data dictionary:
Show range validation code
<- agent %>%
agent # Bill length should be between 25-70mm
col_vals_between(
columns = bill_length_mm,
left = 25, right = 70,
na_pass = TRUE, # Allow NA values
label = "Bill length in valid range (25-70mm)"
%>%
) # Bill depth should be between 10-25mm
col_vals_between(
columns = bill_depth_mm,
left = 10, right = 25,
na_pass = TRUE,
label = "Bill depth in valid range (10-25mm)"
%>%
) # Flipper length should be between 150-250mm
col_vals_between(
columns = flipper_length_mm,
left = 150, right = 250,
na_pass = TRUE,
label = "Flipper length in valid range (150-250mm)"
%>%
) # Body mass should be between 2000-7000g
col_vals_between(
columns = body_mass_g,
left = 2000, right = 7000,
na_pass = TRUE,
label = "Body mass in valid range (2000-7000g)"
%>%
) # Year should be 2007, 2008, or 2009
col_vals_in_set(
columns = year,
set = c(2007, 2008, 2009),
label = "Year is within study period"
)
Step 4: Categorical Value Checks
Ensure categorical variables contain only expected values:
Show categorical validation code
<- agent %>%
agent # Species should be one of three types
col_vals_in_set(
columns = species,
set = c("Adelie", "Chinstrap", "Gentoo"),
label = "Species is valid (Adelie, Chinstrap, or Gentoo)"
%>%
) # Island should be one of three islands
col_vals_in_set(
columns = island,
set = c("Torgersen", "Biscoe", "Dream"),
label = "Island is valid (Torgersen, Biscoe, or Dream)"
%>%
) # Sex should be male or female
# Note: NA values will be flagged but may be acceptable for this variable
col_vals_in_set(
columns = sex,
set = c("male", "female"),
label = "Sex is male or female"
)
Step 5: Impossible Value Checks
Check for biologically impossible values:
<- agent %>%
agent # No negative measurements
col_vals_gt(
columns = bill_length_mm,
value = 0,
na_pass = TRUE,
label = "Bill length is positive"
%>%
) col_vals_gt(
columns = body_mass_g,
value = 0,
na_pass = TRUE,
label = "Body mass is positive"
)
Running the Validation
Execute all validation rules and generate a report:
# Interrogate the data (run all validations)
<- interrogate(agent)
agent
# View the report
agent
The agent creates an interactive HTML report showing: - Overview: How many tests passed/failed - Each validation rule: With pass/fail status and failure counts - Severity levels: Which issues are critical vs. warnings - Sample failures: Examples of rows that failed each test
When you run validation on penguins_messy
, you should see failures for: - Species: 4 issues (typos like “Adelei”, case errors like “adelie”, extra text like “Gentoo penguin”) - Island: 2 issues (typo “Torgerson”, case error “biscoe”) - Bill length: 3 issues (negative -5.2, impossibly large 250.5, placeholder 999) - Bill depth: 2 issues (negative -2.1, placeholder 99.9) - Flipper length: 1 issue (zero value) - Body mass: 3 issues (extreme outliers: 15000g, 500g, 10000g) - Sex: 22 failures total (11 NAs that are acceptable + 11 actual issues with inconsistent coding: M/F/male/Male/MALE/m/Female) - Year: 4 issues (2020, 2006, 207, 20009 - outside 2007-2009 range or digit errors)
Total: 30 data quality issues we intentionally added for practice! Note that the sex validation will also flag 11 NA values (which are acceptable in this dataset), bringing the total reported failures to 41.
Comparing Clean vs. Messy Data
Let’s validate both versions to see the difference:
# Load and validate clean data
<- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
penguins_clean
<-
agent_clean create_agent(
tbl = penguins_clean,
tbl_name = "penguins_clean",
label = "Palmer Penguins (Clean) Validation"
%>%
) # Add all the same validation rules as above
col_vals_between(columns = bill_length_mm, left = 25, right = 70, na_pass = TRUE) %>%
col_vals_between(columns = body_mass_g, left = 2000, right = 7000, na_pass = TRUE) %>%
col_vals_in_set(columns = species, set = c("Adelie", "Chinstrap", "Gentoo")) %>%
col_vals_in_set(columns = year, set = c(2007, 2008, 2009)) %>%
interrogate()
# View clean data report
agent_clean
# Compare: clean should pass all checks, messy should fail several
Setting Failure Thresholds
Define what level of failures is acceptable:
<-
agent create_agent(
tbl = penguins_messy,
tbl_name = "penguins_messy"
%>%
) col_vals_between(
columns = bill_length_mm,
left = 25, right = 70,
na_pass = TRUE,
actions = action_levels(warn_at = 0.01, stop_at = 0.05)
# Warn if >1% failures, stop if >5% failures
%>%
) interrogate()
Set thresholds based on your research needs: - stop_at = 0: Zero tolerance (data analysis stops if any failures) - warn_at = 0.01: Flag if >1% of rows fail (for investigation) - notify_at = 0.001: Notification if >0.1% fail (minor issues)
For published datasets, use strict thresholds. For preliminary data collection, more lenient thresholds may be appropriate.
Creating Reusable Validation Functions
Build a validation workflow you can apply to any similar dataset:
Show reusable validation function
<- function(data, data_name = "Penguin Data") {
validate_penguin_data
<-
agent create_agent(
tbl = data,
tbl_name = data_name,
label = paste(data_name, "Validation Report")
%>%
) # Structure checks
col_exists(columns = c("species", "island", "bill_length_mm",
"bill_depth_mm", "flipper_length_mm",
"body_mass_g", "sex", "year")) %>%
col_is_numeric(columns = c("bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g", "year")) %>%
# Range checks
col_vals_between(columns = bill_length_mm, left = 25, right = 70,
na_pass = TRUE, label = "Bill length 25-70mm") %>%
col_vals_between(columns = bill_depth_mm, left = 10, right = 25,
na_pass = TRUE, label = "Bill depth 10-25mm") %>%
col_vals_between(columns = flipper_length_mm, left = 150, right = 250,
na_pass = TRUE, label = "Flipper length 150-250mm") %>%
col_vals_between(columns = body_mass_g, left = 2000, right = 7000,
na_pass = TRUE, label = "Body mass 2000-7000g") %>%
# Categorical checks
col_vals_in_set(columns = species,
set = c("Adelie", "Chinstrap", "Gentoo"),
label = "Valid species") %>%
col_vals_in_set(columns = island,
set = c("Torgersen", "Biscoe", "Dream"),
label = "Valid island") %>%
col_vals_in_set(columns = sex, set = c("male", "female"),
label = "Valid sex") %>%
col_vals_in_set(columns = year, set = c(2007, 2008, 2009),
label = "Valid year") %>%
# Impossible value checks
col_vals_gt(columns = bill_length_mm, value = 0, na_pass = TRUE,
label = "Positive bill length") %>%
col_vals_gt(columns = body_mass_g, value = 0, na_pass = TRUE,
label = "Positive body mass") %>%
# Run validation
interrogate()
return(agent)
}
# Use the function
<- validate_penguin_data(penguins_messy, "Messy Penguins")
messy_validation <- validate_penguin_data(penguins_clean, "Clean Penguins")
clean_validation
# View reports
messy_validation clean_validation
Exporting Validation Reports
Share validation results with collaborators:
# Export as HTML file
export_report(agent, filename = "penguin_validation_report.html")
# The HTML file can be:
# - Shared via email
# - Posted on project websites
# - Included in data repositories
# - Attached to manuscripts as supplementary material
Integration with Your Workflow
Validation as Quality Gate
Make validation a required step before analysis:
# Validation function that stops if critical failures occur
<- function(data) {
safe_analysis_workflow
# Step 1: Validate
cat("Step 1: Validating data...\n")
<- validate_penguin_data(data, "Analysis Data")
agent
# Step 2: Check if validation passed
if (all_passed(agent)) {
cat("Validation passed! Proceeding with analysis.\n")
# Step 3: Your analysis code here
# analysis_results <- run_analysis(data)
else {
} cat("Validation failed! Review issues before analyzing.\n")
cat("See validation report for details.\n")
return(agent) # Return agent to review failures
}
}
# Use in your workflow
safe_analysis_workflow(penguins_messy) # Should stop
safe_analysis_workflow(penguins_clean) # Should proceed
- Validate early: Check data quality before investing time in analysis
- Validate often: Re-run validation each time data is updated
- Document thresholds: Explain why you set specific failure thresholds
- Share validation code: Include in analysis scripts or data repositories
- Archive reports: Keep validation reports with analysis outputs
Key Takeaways
Systematic validation with {pointblank}
provides:
- Automated checking: No manual inspection of every value
- Comprehensive reports: Professional HTML documentation
- Reproducibility: Same validation rules applied consistently
- Collaboration: Share validation procedures with colleagues
- Confidence: Know your data meets quality standards
Compare your experience with manual checking (Section 3.2) versus automated validation. Which issues were easier to catch with each approach? How might combining both methods strengthen your data quality workflow?
Congratulations!
You’ve completed this tutorial on data documentation and validation! You now have the skills to:
- Create comprehensive data dictionaries (manual and automated)
- Use summary statistics to identify data quality issues
- Implement systematic validation workflows with R packages
- Generate professional validation reports
These skills will help you maintain high data quality standards, collaborate more effectively, and contribute to reproducible research practices in your field.