Automated Creation with R

In the previous section, you created a data dictionary manually, carefully considering each variable’s meaning, valid values, and context. This thoughtful approach is essential, but for larger datasets or when documentation needs regular updates, R can automate much of this work while preserving your expertise.

What You’ll Learn

By the end of this section, you will:

Understand when to use automated vs. manual data dictionary creation
Use the {datawizard} package to generate codebooks automatically
Add descriptive labels to your data using R
Export codebooks as shareable CSV files
Know how to integrate automated documentation into your research workflow

From Manual to Automated

You can create data dictionaries in three ways:

Manually (previous section): Full control, good for small datasets
Automated with R (this section): Efficient for larger datasets, easy to update
Hybrid approach: Start with automated generation, then refine with your domain knowledge

Most research projects benefit from automation, especially when data evolves during collection or analysis.

The `{datawizard}` Package

The {datawizard} package (datawizard?) provides a simple, reliable way to create data dictionaries automatically. The main function you’ll use is data_codebook(), which generates structured tables that include:

Variable names and types
Missing value counts
Value ranges for numeric variables
Frequency counts for categorical variables
Summary statistics

Creating Your First Automated Dictionary

Let’s create a codebook for the Palmer Penguins data.

Step 1: Load and Prepare Data

We’ll start by loading the clean penguins dataset. Make sure you’ve installed the {datawizard} package as described in the tutorial setup.

library(readr)
library(dplyr)
library(datawizard)

# Load the clean penguins data from CSV
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

Step 2: Generate the Codebook

The simplest way to create a codebook is to call data_codebook() on your data frame:

# Create the data codebook using datawizard's data_codebook() function
codebook_penguins <- data_codebook(penguins)

# View the codebook
print(codebook_penguins)

This single command generates a structured table with:

Variable ID and Name
Data Type (character, numeric, etc.)
Missing Values (count and percentage)
Values/Ranges (categories or numeric ranges)
Frequencies (N and percentages for each category)

The output is a data frame that can be exported to CSV, Excel, or included in R Markdown/Quarto documents.

Step 3: Add Variable Labels

You can add descriptive labels to make the codebook more informative. These labels are stored as attributes (extra information attached to each column).

What this code does:

structure(column, label = "description") attaches a label to a column without changing its data
mutate() replaces each column with a labeled version of itself (same data, just with a description attached)
The %>% pipe operator passes the data from one step to the next

Important: The number of columns stays the same, and the data values don’t change. You’re just adding descriptive information that data_codebook() can extract and display.

For Python Users

This is similar to how pandas allows you to add .attrs to Series or DataFrames. The %>% pipe operator works like method chaining in Python.

# Add variable labels using attributes
penguins_labeled <- penguins %>%
  mutate(
    species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
    island = structure(island, label = "Island where penguin was observed"),
    bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
    bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
    flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
    body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
    sex = structure(sex, label = "Biological sex (male, female, or NA)"),
    year = structure(year, label = "Year of observation (2007-2009)")
  )

# Generate codebook with labels
codebook_labeled <- data_codebook(penguins_labeled)
print(codebook_labeled)

How Labels Work

When you add labels with structure(), they’re attached to columns in R’s memory. When you create a codebook with data_codebook(), these labels get extracted and displayed in the output table.

Important note: If you save penguins_labeled as a CSV, the labels will be lost (CSV can only store data, not attributes). But the codebook itself preserves the labels as regular text in a “Label” column, so when you export the codebook, everything is saved properly.

Thinking Point

Compare these labels to your manual dictionary from the previous section. Are they equally informative? What context might be missing? This is why the “automated” approach still requires your careful thought, you’re providing the content, R is just handling the presentation.

Checkpoint: Verify Your Codebook

After running the code above, you should have:

A codebook_labeled object in R containing the formatted codebook
The codebook displayed in your console with all 8 variables
Labels visible in the output for each variable

If you don’t see the labels in the output, check that you ran the penguins_labeled code before generating the codebook.

Saving Your Codebook

Now save the codebook to a CSV file so you can share it with collaborators:

# Save as CSV file
write.csv(codebook_labeled, "data/penguins_codebook.csv", row.names = FALSE)

Check that penguins_codebook.csv was created in your data folder. Open it in a spreadsheet program to see how the codebook looks as a shareable file.

Integrating Codebooks into Quarto Documents

One powerful approach is to include your codebook directly in a Quarto document. This creates living documentation that updates automatically when your data changes.

Exercise: Create a Living Data Dictionary

Create a Quarto document that combines your manual dictionary with the automated codebook.

Steps:

Create a new file called penguins_documentation.qmd in your project folder (the same location as your ._quarto.yml and .Rproj file). Note that the data is in a subfolder called data.
Copy this template into the file:

---
title: "Palmer Penguins Data Dictionary"
format: html
---

## About This Dataset

This data dictionary documents the Palmer Penguins dataset used for [describe your project purpose].

## Manual Data Dictionary

Here's the manual dictionary I created for key variables:

[Paste your manual dictionary table from the previous section here]

## Automated Codebook

Below is the automatically generated codebook with all variables:

```{r}
#| echo: false
#| message: false

library(readr)
library(dplyr)
library(datawizard)

# Load data
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Add labels
penguins_labeled <- penguins %>%
  mutate(
    species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
    island = structure(island, label = "Island where penguin was observed"),
    bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
    bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
    flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
    body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
    sex = structure(sex, label = "Biological sex (male, female, or NA)"),
    year = structure(year, label = "Year of observation (2007-2009)")
  )

# Generate and display codebook with narrower labels for better display
data_codebook(penguins_labeled, variable_label_width = 10, value_label_width = 10)
```

Render the document (click “Render” button or run quarto render penguins_documentation.qmd)
View the HTML output, you now have a complete, shareable data dictionary!

Why this is powerful:

If your data changes, just re-render the document
The codebook automatically reflects the current data
You can share the HTML file with collaborators
Manual context + automated statistics in one place

Choosing Your Approach

Use automated codebooks when:

You have many variables
Data changes during collection/analysis
You need consistent documentation across a team
You’re sharing data in repositories

Use manual dictionaries when:

You have few variables with unique context
You need custom formatting for publication
Complex measurements require detailed explanations

Hybrid approach (recommended): Generate automated codebook first, then add manual annotations for complex variables.

Remember

Automation doesn’t replace your expertise! You still provide the scientific knowledge (labels, context, measurement methods). R just handles the repetitive formatting and summary statistics.

Next Steps

You now know how to create data dictionaries both manually and automatically. In the next section, you’ll use these documentation skills as the foundation for data validation, checking whether your actual data matches what your dictionary says it should be.