library(readr)
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
head(penguins_clean)Concepts & Manual Creation
Without proper documentation, datasets become nearly impossible for others (or future you) to understand and use. In this hands-on section, you’ll learn to create data dictionaries manually and practice with real research data.
By the end of this section, you will:
- Understand what a data dictionary is and why it matters for open science
- Know the essential components every data dictionary should include
- Be able to create a data dictionary manually using spreadsheets, Word, or Markdown
- Have created a partial data dictionary for the Palmer Penguins dataset
What is a Data Dictionary?
A data dictionary is a structured document that provides comprehensive information about each variable in your dataset. Think of it as the “instruction manual” for your data, it tells anyone (including future you) exactly what each column means, how it was measured, and how to interpret the values.
Before we continue, reflect on your most recent research project. How would a colleague understand your data if you shared it with them today? What questions would they need to ask you?
Essential Components
Every data dictionary should include:
- Variable Name: The exact column name in your data file (e.g.,
bill_length_mm) - Variable Label: A human-readable name (e.g., “Bill Length”)
- Description: Clear definition of what was measured (e.g., “Length of the penguin’s bill measured in millimeters”)
- Data Type: The type of data stored (numeric, character, categorical, date)
- Units: For numeric variables, the unit of measurement (mm, grams, years)
- Valid Values: Expected range or categories (e.g., “0-100” or “male, female”)
- Missing Values: How missing data is coded (NA, -999, blank)
Why Data Dictionaries Matter
Preventing Critical Errors
Without documentation, researchers make dangerous assumptions. Is score a percentage (0-100) or raw points (0-50)? Is treatment coded as 1/0 or A/B? These ambiguities lead to analysis errors that can invalidate entire studies.
A research team spent months analyzing survey data where satisfaction was coded 1-5. They treated 5 as “very satisfied.” Only later did they discover that 1 was actually the highest satisfaction score. Their conclusions were completely backwards! This type of error with reverse-coded items is well-documented in research and can significantly impact study results (Hughes 2009).
Enabling Collaboration
When a new team member joins your project, a data dictionary gets them productive immediately. They can understand and use your data without hours of explanation or guesswork.
Supporting FAIR Principles
FAIR data must be Findable, Accessible, Interoperable, and Reusable (Wilkinson et al. 2016). A quality data dictionary directly supports all four principles by making your data understandable and usable by others.
For more on FAIR principles and research data management concepts, check out our FAIR Data Management Tutorial.
A Quick Example
Here’s undocumented data on study habits and academic performance:
| student_id | study_hrs | test_score | major | satisfaction | attend |
|---|---|---|---|---|---|
| 001 | 15 | 85 | PSYC | 4 | Y |
| 002 | 8 | 72 | BIOL | 3 | N |
| 003 | 22 | 94 | PSYC | 5 | Y |
| 004 | NA | 68 | MATH | 2 | Y |
To make sense of this data, and avoid misinterpretation, we have many questions:
- What time period do
study_hrscover? Per week? Per month? - Is
test_scoreout of 100? What test was this? - What does the satisfaction scale mean? Is 1 low or high?
- What do Y/N in
attendrepresent? - Why is there an NA in
study_hrsfor student 004?
With a data dictionary, these questions disappear. Here’s what documentation looks like:
| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---|---|---|---|---|---|---|
| student_id | Student ID | Unique identifier for each participant | character | none | 001-999 | none |
| study_hrs | Weekly Study Hours | Self-reported hours spent studying per week | numeric | hours | 0-168 | NA |
| test_score | Midterm Score | Score on standardized midterm exam | numeric | points | 0-100 | none |
| major | Academic Major | Student’s declared major field | categorical | none | PSYC, BIOL, MATH, CHEM | none |
| satisfaction | Course Satisfaction | Rating of course satisfaction (5-point scale) | numeric | 1-5 scale | 1=very unsatisfied, 5=very satisfied | none |
| attend | Lecture Attendance | Regular attendance at lectures (>80%) | categorical | none | Y=yes, N=no | none |
This documentation clarifies everything at a glance, preventing misinterpretation and errors. Now we know that study_hrs are weekly hours, test_score is out of 100, and satisfaction is on a 1-5 scale. This makes analysis straightforward and reliable.
Choosing Your Documentation Tool
Before creating your data dictionary, select the tool that best fits your workflow. Here are three common approaches:
Spreadsheet Software (Excel, Google Sheets, LibreOffice)
Best for: Most research teams, easy sharing and collaboration
Create your dictionary table directly in a spreadsheet with columns for each essential component. Save as CSV for better compatibility with R and other analysis software.
Pros: Familiar interface, collaborative editing, easy to share
Cons: Can be tricky to track changes over time
Markdown Tables (in .md or .qmd files)
Best for: Projects using Quarto or reproducible workflows
Use plain text tables that render beautifully in Quarto documents. These are great for keeping your documentation in the same file as your analysis code.
Template:
| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| variable1 | | | | | | |
| variable2 | | | | | | |
Pros: Works seamlessly with Quarto, readable as plain text, keeps documentation with code
Cons: Less intuitive for non-technical collaborators
Word Processors (Word, Google Docs)
Best for: Teams preferring familiar document formats
Create a simple table with the essential columns, similar to spreadsheets but with more formatting options.
| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| species | | | | | | |
| bill_length_mm| | | | | | |
| sex | | | | | | |
| year | | | | | | |
Pros: Rich formatting, familiar to most researchers
Cons: Harder to integrate with R analysis code
Since you’re learning with Quarto, try creating your data dictionary as a markdown table in a .qmd file. This will help you practice Quarto syntax and see how documentation integrates with your analysis code.
Exercise: Document the Palmer Penguins Data
Now apply what you’ve learned to the Palmer Penguins dataset from the previous section.
Step 1: Load and Examine the Data
Examine the data by either opening the CSV with a text editor or running this code in R to see what you’re working with:
Step 2: Choose Your Documentation Tool
Pick one:
- Spreadsheet (Excel, Google Sheets): Open a new spreadsheet file
- Markdown (recommended for this tutorial): Create a new file
data/penguins_clean_dictionary.qmdin your project folder alongside your data - Word processor: Open a new Word/Google Docs document
Step 3: Create Your Data Dictionary
Create a data dictionary for these 4 variables: species, bill_length_mm, sex, year
For each variable, you’ll document:
- Label: Human-readable name
- Description: What does this measure or represent?
- Type: categorical or numeric?
- Units: What are the units (if numeric)?
- Valid Values: What values are valid/expected?
- Missing: How are missing data coded?
- Open a new spreadsheet
- In row 1, create these 7 column headers:
- Variable Name | Label | Description | Type | Units | Valid Values | Missing
- In rows 2-5, add these variable names in the first column:
- species
- bill_length_mm
- sex
- year
- Fill in the remaining cells for each variable
- Open a new document
- Insert a table with 7 columns and 5 rows (Table → Insert Table → 7x5)
- In row 1, add the column headers:
- Variable Name | Label | Description | Type | Units | Valid Values | Missing
- In rows 2-5, add these variable names in the first column:
- species
- bill_length_mm
- sex
- year
- Fill in the remaining cells for each variable
- Create a new file quarto file named
penguins_clean_dictionaryand save it in thedatafolder of your project. - Copy and paste this markdown table into the file:
| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|----------------|-------|-------------|------|-------|--------------|---------|
| species | | | | | | |
| bill_length_mm | | | | | | |
| sex | | | | | | |
| year | | | | | | |- Fill in the empty cells for each variable
Complete this exercise before looking at the solution below!
| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---|---|---|---|---|---|---|
| species | Penguin Species | Species of penguin observed | categorical | none | Adelie, Chinstrap, Gentoo | none |
| bill_length_mm | Bill Length | Length of the penguin’s bill (culmen) from tip to base | numeric | millimeters | 32.1-59.6 | NA |
| sex | Penguin Sex | Biological sex of the penguin | categorical | none | male, female | NA |
| year | Observation Year | Year the observation was recorded | numeric | year | 2007, 2008, 2009 | none |
Additional context:
- Bill length represents the culmen length (the dorsal ridge of the bill)
- Missing values (NA) in bill_length_mm and sex occur when measurements couldn’t be obtained
- Year is numeric but only 2007-2009 are valid values
When Manual Creation is Most Effective
Manual creation works optimally when:
- You have few variables
- Your data structure remains stable over time
- You need detailed, contextual descriptions
- You’re collaborating with non-technical team members
- You require precise control over documentation details
For larger datasets manual creation can become tedious. The next section covers automated approaches using R packages, which can generate data dictionaries instantly and update them automatically when your data changes.