Automated Creation with R

In the previous section, you created a data dictionary manually, carefully considering each variable’s meaning, valid values, and context. This thoughtful approach is essential, but for larger datasets or when documentation needs regular updates, R can automate much of this work while preserving your expertise.

From Manual to Automated

You can create data dictionaries in two ways:

  1. Manually (previous section): Full control, good for small datasets
  2. Automated with R (this section): Efficient for larger datasets, easy to update
  3. Hybrid approach: Start with automated generation, then refine with your domain knowledge

Most research projects benefit from automation, especially when data evolves during collection or analysis.

The {datawizard} Package

The {datawizard} package provides a simple, reliable way to create data dictionaries automatically. The main function you’ll use is data_codebook(), which generates structured tables that include:

  • Variable names and types
  • Missing value counts
  • Value ranges for numeric variables
  • Frequency counts for categorical variables
  • Summary statistics

Creating Your First Automated Dictionary

Let’s create a codebook for the Palmer Penguins data.

Step 1: Load and Prepare Data

library(readr)
library(dplyr)
library(datawizard)

# Load the clean penguins data from CSV
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

Step 2: Generate the Codebook

# Create the data codebook using datawizard's data_codebook() function
codebook_penguins <- data_codebook(penguins)

# View the codebook
print(codebook_penguins)

This single command generates a structured table with:

  • Variable ID and Name
  • Data Type (character, numeric, etc.)
  • Missing Values (count and percentage)
  • Values/Ranges (categories or numeric ranges)
  • Frequencies (N and percentages for each category)

What Does the Output Look Like?

The data_codebook() function creates a formatted table. For each variable, you’ll see:

  • Categorical variables: List of unique values with counts and percentages
  • Numeric variables: Range (min, max) and count of non-missing values
  • Missing data: Explicit count and percentage for transparency
  • Clean formatting: Easy to read in console, save to file, or include in reports

The output is a data frame that can be exported to CSV, Excel, or included in R Markdown/Quarto documents.

Step 3: Add Variable Labels (Optional)

You can add descriptive labels to make the codebook more informative. These labels are stored as metadata (attributes) attached to each column, similar to how pandas allows you to add .attrs to Series or DataFrames.

What this code does:

  • structure(column, label = "description") attaches a label attribute to a column without changing its data
  • mutate() replaces each column with a labeled version of itself (same data, but with metadata attached)
  • The %>% pipe operator passes the data from one step to the next (like method chaining in Python)

Important: The number of columns stays the same, and the data values don’t change. You’re just adding descriptive metadata that data_codebook() can extract and display.

# Add variable labels using attributes
penguins_labeled <- penguins %>%
  mutate(
    species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
    island = structure(island, label = "Island where penguin was observed"),
    bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
    bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
    flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
    body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
    sex = structure(sex, label = "Biological sex (male, female, or NA)"),
    year = structure(year, label = "Year of observation (2007-2009)")
  )

# Generate codebook with labels
codebook_labeled <- data_codebook(penguins_labeled)
print(codebook_labeled)
TipHow Labels Are Preserved

The labels you added with structure() are stored as attributes (metadata) attached to each column in memory. The data_codebook() function extracts these labels and includes them as regular data in the codebook table itself. This means:

  • While working in R: Labels are stored as metadata on penguins_labeled
  • In the codebook: Labels become regular text in a “Label” column of codebook_labeled
  • When you export the codebook (next section): The labels are saved as normal data, so they’ll be preserved in CSV files

The original penguins_labeled data would lose its label attributes if saved as CSV, but the codebook preserves them because they’re converted from metadata into regular table content.

NoteThinking Point

Compare these labels to your manual dictionary. Are they equally informative? What context might be missing? This is why the “automated” approach still requires your careful thought, you’re providing the content, R is just handling the presentation.

Exporting Your Codebook

Save the codebook for sharing:

# Save as CSV file
write.csv(codebook_labeled, "data/penguins_codebook.csv", row.names = FALSE)

# Or save as a formatted text file
writeLines(capture.output(print(codebook_labeled)), "data/penguins_codebook.txt")

You can also include the codebook directly in your Quarto/R Markdown reports:

# In a Quarto document, the codebook will display as a formatted table
data_codebook(penguins)
TipBest Practice

Include codebook generation in your analysis scripts. This ensures documentation updates automatically when your data changes:

# At the end of your data cleaning script
codebook <- data_codebook(my_clean_data)
write.csv(codebook, "outputs/data_codebook.csv", row.names = FALSE)

When to Use Automated Dictionaries

Use automated approach when:

  • You have more than ~20 variables
  • Data structure changes during collection/analysis
  • You need quick summaries of value distributions
  • Multiple team members need consistent documentation
  • You’re sharing data in repositories

Consider manual approach when:

  • You have fewer than 10 variables with unique contexts
  • You need custom formatting for publication
  • Collaborators prefer traditional document formats
  • Complex measurement protocols require extensive explanation

Hybrid approach (recommended):

  1. Generate automated codebook first
  2. Review output for completeness
  3. Add manual annotations for complex variables
  4. Share both automated report and any custom notes

Advantages for Open Science

Automated codebooks support reproducible research:

  • Version control: Include in Git repositories alongside data
  • Reproducibility: Others can regenerate with same inputs
  • Transparency: Shows exactly what’s documented and how
  • Efficiency: Update documentation as data evolves
  • Standards: Consistent format across projects
  • Export flexibility: Save as CSV, text, or include in reports
ImportantYour Expertise Still Required

Automation doesn’t replace thinking! You still need to:

  • Add informative variable labels that explain what was measured
  • Provide context about measurement methods and instruments
  • Explain unusual coding or missing value patterns
  • Document units, valid ranges, and any data transformations
  • Consider what collaborators need to know to use your data correctly

R handles the repetitive formatting and summary statistics; you provide the scientific knowledge and context.

Next Steps

You now know how to create data dictionaries both manually and automatically. In the next section, you’ll use these documentation skills as the foundation for data validation, checking whether your actual data matches what your dictionary says it should be.

Back to top