library(readr)
library(dplyr)
library(datawizard)
# Load the clean penguins data from CSV
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)Automated Creation with R
In the previous section, you created a data dictionary manually, carefully considering each variable’s meaning, valid values, and context. This thoughtful approach is essential, but for larger datasets or when documentation needs regular updates, R can automate much of this work while preserving your expertise.
By the end of this section, you will:
- Understand when to use automated vs. manual data dictionary creation
- Use the
{datawizard}package to generate codebooks automatically - Add descriptive labels to your data using R
- Export codebooks as shareable CSV files
- Know how to integrate automated documentation into your research workflow
From Manual to Automated
You can create data dictionaries in three ways:
- Manually (previous section): Full control, good for small datasets
- Automated with R (this section): Efficient for larger datasets, easy to update
- Hybrid approach: Start with automated generation, then refine with your domain knowledge
Most research projects benefit from automation, especially when data evolves during collection or analysis.
The {datawizard} Package
The {datawizard} package (datawizard?) provides a simple, reliable way to create data dictionaries automatically. The main function you’ll use is data_codebook(), which generates structured tables that include:
- Variable names and types
- Missing value counts
- Value ranges for numeric variables
- Frequency counts for categorical variables
- Summary statistics
Creating Your First Automated Dictionary
Let’s create a codebook for the Palmer Penguins data.
Step 1: Load and Prepare Data
We’ll start by loading the clean penguins dataset. Make sure you’ve installed the {datawizard} package as described in the tutorial setup.
Step 2: Generate the Codebook
The simplest way to create a codebook is to call data_codebook() on your data frame:
# Create the data codebook using datawizard's data_codebook() function
codebook_penguins <- data_codebook(penguins)
# View the codebook
print(codebook_penguins)This single command generates a structured table with:
- Variable ID and Name
- Data Type (character, numeric, etc.)
- Missing Values (count and percentage)
- Values/Ranges (categories or numeric ranges)
- Frequencies (N and percentages for each category)
The output is a data frame that can be exported to CSV, Excel, or included in R Markdown/Quarto documents.
Step 3: Add Variable Labels
You can add descriptive labels to make the codebook more informative. These labels are stored as attributes (extra information attached to each column).
What this code does:
structure(column, label = "description")attaches a label to a column without changing its datamutate()replaces each column with a labeled version of itself (same data, just with a description attached)- The
%>%pipe operator passes the data from one step to the next
Important: The number of columns stays the same, and the data values don’t change. You’re just adding descriptive information that data_codebook() can extract and display.
This is similar to how pandas allows you to add .attrs to Series or DataFrames. The %>% pipe operator works like method chaining in Python.
# Add variable labels using attributes
penguins_labeled <- penguins %>%
mutate(
species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
island = structure(island, label = "Island where penguin was observed"),
bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
sex = structure(sex, label = "Biological sex (male, female, or NA)"),
year = structure(year, label = "Year of observation (2007-2009)")
)
# Generate codebook with labels
codebook_labeled <- data_codebook(penguins_labeled)
print(codebook_labeled)When you add labels with structure(), they’re attached to columns in R’s memory. When you create a codebook with data_codebook(), these labels get extracted and displayed in the output table.
Important note: If you save penguins_labeled as a CSV, the labels will be lost (CSV can only store data, not attributes). But the codebook itself preserves the labels as regular text in a “Label” column, so when you export the codebook, everything is saved properly.
Compare these labels to your manual dictionary from the previous section. Are they equally informative? What context might be missing? This is why the “automated” approach still requires your careful thought, you’re providing the content, R is just handling the presentation.
Saving Your Codebook
Now save the codebook to a CSV file so you can share it with collaborators:
# Save as CSV file
write.csv(codebook_labeled, "data/penguins_codebook.csv", row.names = FALSE)Check that penguins_codebook.csv was created in your data folder. Open it in a spreadsheet program to see how the codebook looks as a shareable file.
Integrating Codebooks into Quarto Documents
One powerful approach is to include your codebook directly in a Quarto document. This creates living documentation that updates automatically when your data changes.
Create a Quarto document that combines your manual dictionary with the automated codebook.
Steps:
Create a new file called
penguins_documentation.qmdin your project folder (the same location as your._quarto.ymland.Rprojfile). Note that the data is in a subfolder calleddata.Copy this template into the file:
---
title: "Palmer Penguins Data Dictionary"
format: html
---
## About This Dataset
This data dictionary documents the Palmer Penguins dataset used for [describe your project purpose].
## Manual Data Dictionary
Here's the manual dictionary I created for key variables:
[Paste your manual dictionary table from the previous section here]
## Automated Codebook
Below is the automatically generated codebook with all variables:
```{r}
#| echo: false
#| message: false
library(readr)
library(dplyr)
library(datawizard)
# Load data
penguins <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
# Add labels
penguins_labeled <- penguins %>%
mutate(
species = structure(species, label = "Penguin species observed in Palmer Archipelago"),
island = structure(island, label = "Island where penguin was observed"),
bill_length_mm = structure(bill_length_mm, label = "Bill length from tip to base (mm)"),
bill_depth_mm = structure(bill_depth_mm, label = "Bill depth at base (mm)"),
flipper_length_mm = structure(flipper_length_mm, label = "Flipper length from body to tip (mm)"),
body_mass_g = structure(body_mass_g, label = "Body mass measured in field (grams)"),
sex = structure(sex, label = "Biological sex (male, female, or NA)"),
year = structure(year, label = "Year of observation (2007-2009)")
)
# Generate and display codebook with narrower labels for better display
data_codebook(penguins_labeled, variable_label_width = 10, value_label_width = 10)
```Render the document (click “Render” button or run
quarto render penguins_documentation.qmd)View the HTML output, you now have a complete, shareable data dictionary!
Why this is powerful:
- If your data changes, just re-render the document
- The codebook automatically reflects the current data
- You can share the HTML file with collaborators
- Manual context + automated statistics in one place
Choosing Your Approach
Use automated codebooks when:
- You have many variables
- Data changes during collection/analysis
- You need consistent documentation across a team
- You’re sharing data in repositories
Use manual dictionaries when:
- You have few variables with unique context
- You need custom formatting for publication
- Complex measurements require detailed explanations
Hybrid approach (recommended): Generate automated codebook first, then add manual annotations for complex variables.
Automation doesn’t replace your expertise! You still provide the scientific knowledge (labels, context, measurement methods). R just handles the repetitive formatting and summary statistics.
Next Steps
You now know how to create data dictionaries both manually and automatically. In the next section, you’ll use these documentation skills as the foundation for data validation, checking whether your actual data matches what your dictionary says it should be.