Automated Creation with R

In the previous section, you created a data dictionary manually, carefully considering each variable’s meaning, valid values, and context. This thoughtful approach is essential, but for larger datasets or when documentation needs regular updates, R can automate much of this work while preserving your expertise.

NoteWhat You’ll Learn

By the end of this section, you will:

  • Understand when to use automated vs. manual data dictionary creation
  • Use the {datawizard} package to generate codebooks automatically
  • Add descriptive labels to your data using R
  • Export codebooks as shareable CSV files
  • Know how to integrate automated documentation into your research workflow

From Manual to Automated

You can create data dictionaries in three ways:

  1. Manually (previous section): Full control, good for small datasets
  2. Automated with R (this section): Efficient for larger datasets, easy to update
  3. Hybrid approach: Start with automated generation, then refine with your domain knowledge

Most research projects benefit from automation, especially when data evolves during collection or analysis.

The {datawizard} Package

The {datawizard} package (datawizard?) provides a simple, reliable way to create data dictionaries automatically. The main function you’ll use is data_codebook(), which generates structured tables that include:

  • Variable names and types
  • Missing value counts
  • Value ranges for numeric variables
  • Frequency counts for categorical variables
  • Summary statistics

Creating Your First Automated Dictionary

Let’s create a codebook for the Palmer Penguins data.

Step 1: Load and Prepare Data

We’ll start by loading the clean penguins dataset. Make sure you’ve installed the {datawizard} package as described in the tutorial setup.

library(readr)
library(dplyr)
library(datawizard)

# Load the clean penguins data from CSV
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

Step 2: Generate the Codebook

The simplest way to create a codebook is to call data_codebook() on your data frame:

# Create the data codebook using datawizard's data_codebook() function
codebook_penguins <- data_codebook(penguins)

# View the codebook
print(codebook_penguins)

This single command generates a structured table with:

  • Variable ID and Name
  • Data Type (character, numeric, etc.)
  • Missing Values (count and percentage)
  • Values/Ranges (categories or numeric ranges)
  • Frequencies (N and percentages for each category)

The output is a data frame that can be exported to CSV, Excel, or included in R Markdown/Quarto documents.

Step 3: Add Variable Labels

You can add descriptive labels to make the codebook more informative. These labels are stored as attributes (extra information attached to each column).

What this code does:

  • structure(column, label = "description") attaches a label to a column without changing its data
  • mutate() replaces each column with a labeled version of itself (same data, just with a description attached)
  • The %>% pipe operator passes the data from one step to the next

Important: The number of columns stays the same, and the data values don’t change. You’re just adding descriptive information that data_codebook() can extract and display.

This is similar to how pandas allows you to add .attrs to Series or DataFrames. The %>% pipe operator works like method chaining in Python.

# Add variable labels using attributes
penguins_labeled <- penguins %>%
  mutate(
    species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
    island = structure(island, label = "Island where penguin was observed"),
    bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
    bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
    flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
    body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
    sex = structure(sex, label = "Biological sex (male, female, or NA)"),
    year = structure(year, label = "Year of observation (2007-2009)")
  )

# Generate codebook with labels
codebook_labeled <- data_codebook(penguins_labeled)
print(codebook_labeled)
TipHow Labels Work

When you add labels with structure(), they’re attached to columns in R’s memory. When you create a codebook with data_codebook(), these labels get extracted and displayed in the output table.

Important note: If you save penguins_labeled as a CSV, the labels will be lost (CSV can only store data, not attributes). But the codebook itself preserves the labels as regular text in a “Label” column, so when you export the codebook, everything is saved properly.

NoteThinking Point

Compare these labels to your manual dictionary from the previous section. Are they equally informative? What context might be missing? This is why the “automated” approach still requires your careful thought, you’re providing the content, R is just handling the presentation.

TipCheckpoint: Verify Your Codebook

After running the code above, you should have:

  1. A codebook_labeled object in R containing the formatted codebook
  2. The codebook displayed in your console with all 8 variables
  3. Labels visible in the output for each variable

If you don’t see the labels in the output, check that you ran the penguins_labeled code before generating the codebook.

Saving Your Codebook

Now save the codebook to a CSV file so you can share it with collaborators:

# Save as CSV file
write.csv(codebook_labeled, "data/penguins_codebook.csv", row.names = FALSE)

Check that penguins_codebook.csv was created in your data folder. Open it in a spreadsheet program to see how the codebook looks as a shareable file.

Integrating Codebooks into Quarto Documents

One powerful approach is to include your codebook directly in a Quarto document. This creates living documentation that updates automatically when your data changes.

TipExercise: Create a Living Data Dictionary

Create a Quarto document that combines your manual dictionary with the automated codebook.

Steps:

  1. Create a new file called penguins_documentation.qmd in your project folder (the same location as your ._quarto.yml and .Rproj file). Note that the data is in a subfolder called data.

  2. Copy this template into the file:

---
title: "Palmer Penguins Data Dictionary"
format: html
---

## About This Dataset

This data dictionary documents the Palmer Penguins dataset used for [describe your project purpose].

## Manual Data Dictionary

Here's the manual dictionary I created for key variables:

[Paste your manual dictionary table from the previous section here]

## Automated Codebook

Below is the automatically generated codebook with all variables:

```{r}
#| echo: false
#| message: false

library(readr)
library(dplyr)
library(datawizard)

# Load data
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)

# Add labels
penguins_labeled <- penguins %>%
  mutate(
    species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
    island = structure(island, label = "Island where penguin was observed"),
    bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
    bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
    flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
    body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
    sex = structure(sex, label = "Biological sex (male, female, or NA)"),
    year = structure(year, label = "Year of observation (2007-2009)")
  )

# Generate and display codebook with narrower labels for better display
data_codebook(penguins_labeled, variable_label_width = 10, value_label_width = 10)
```
  1. Render the document (click “Render” button or run quarto render penguins_documentation.qmd)

  2. View the HTML output, you now have a complete, shareable data dictionary!

Why this is powerful:

  • If your data changes, just re-render the document
  • The codebook automatically reflects the current data
  • You can share the HTML file with collaborators
  • Manual context + automated statistics in one place

Choosing Your Approach

Use automated codebooks when:

  • You have many variables
  • Data changes during collection/analysis
  • You need consistent documentation across a team
  • You’re sharing data in repositories

Use manual dictionaries when:

  • You have few variables with unique context
  • You need custom formatting for publication
  • Complex measurements require detailed explanations

Hybrid approach (recommended): Generate automated codebook first, then add manual annotations for complex variables.

ImportantRemember

Automation doesn’t replace your expertise! You still provide the scientific knowledge (labels, context, measurement methods). R just handles the repetitive formatting and summary statistics.

Next Steps

You now know how to create data dictionaries both manually and automatically. In the next section, you’ll use these documentation skills as the foundation for data validation, checking whether your actual data matches what your dictionary says it should be.

References

Back to top