Concepts & Manual Creation

Without proper documentation, datasets become nearly impossible for others (or future you) to understand and use. In this hands-on section, you’ll learn to create data dictionaries manually and practice with real research data.

What You’ll Learn

By the end of this section, you will:

Understand what a data dictionary is and why it matters for open science
Know the essential components every data dictionary should include
Be able to create a data dictionary manually using spreadsheets, Word, or Markdown
Have created a partial data dictionary for the Palmer Penguins dataset

What is a Data Dictionary?

A data dictionary is a structured document that provides comprehensive information about each variable in your dataset. Think of it as the “instruction manual” for your data, it tells anyone (including future you) exactly what each column means, how it was measured, and how to interpret the values.

Consider This

Before we continue, reflect on your most recent research project. How would a colleague understand your data if you shared it with them today? What questions would they need to ask you?

Essential Components

Every data dictionary should include:

Variable Name: The exact column name in your data file (e.g., bill_length_mm)
Variable Label: A human-readable name (e.g., “Bill Length”)
Description: Clear definition of what was measured (e.g., “Length of the penguin’s bill measured in millimeters”)
Data Type: The type of data stored (numeric, character, categorical, date)
Units: For numeric variables, the unit of measurement (mm, grams, years)
Valid Values: Expected range or categories (e.g., “0-100” or “male, female”)
Missing Values: How missing data is coded (NA, -999, blank)

Why Data Dictionaries Matter

Preventing Critical Errors

Without documentation, researchers make dangerous assumptions. Is score a percentage (0-100) or raw points (0-50)? Is treatment coded as 1/0 or A/B? These ambiguities lead to analysis errors that can invalidate entire studies.

Real-World Example

A research team spent months analyzing survey data where satisfaction was coded 1-5. They treated 5 as “very satisfied.” Only later did they discover that 1 was actually the highest satisfaction score. Their conclusions were completely backwards! This type of error with reverse-coded items is well-documented in research and can significantly impact study results (Hughes 2009).

Enabling Collaboration

When a new team member joins your project, a data dictionary gets them productive immediately. They can understand and use your data without hours of explanation or guesswork.

Supporting FAIR Principles

FAIR data must be Findable, Accessible, Interoperable, and Reusable (Wilkinson et al. 2016). A quality data dictionary directly supports all four principles by making your data understandable and usable by others.

FAIR Data Management Tutorial

For more on FAIR principles and research data management concepts, check out our FAIR Data Management Tutorial.

A Quick Example

Here’s undocumented data on study habits and academic performance:

student_id	study_hrs	test_score	major	satisfaction	attend
001	15	85	PSYC	4	Y
002	8	72	BIOL	3	N
003	22	94	PSYC	5	Y
004	NA	68	MATH	2	Y

To make sense of this data, and avoid misinterpretation, we have many questions:

What time period do study_hrs cover? Per week? Per month?
Is test_score out of 100? What test was this?
What does the satisfaction scale mean? Is 1 low or high?
What do Y/N in attend represent?
Why is there an NA in study_hrs for student 004?

With a data dictionary, these questions disappear. Here’s what documentation looks like:

Variable Name	Label	Description	Type	Units	Valid Values	Missing
student_id	Student ID	Unique identifier for each participant	character	none	001-999	none
study_hrs	Weekly Study Hours	Self-reported hours spent studying per week	numeric	hours	0-168	NA
test_score	Midterm Score	Score on standardized midterm exam	numeric	points	0-100	none
major	Academic Major	Student’s declared major field	categorical	none	PSYC, BIOL, MATH, CHEM	none
satisfaction	Course Satisfaction	Rating of course satisfaction (5-point scale)	numeric	1-5 scale	1=very unsatisfied, 5=very satisfied	none
attend	Lecture Attendance	Regular attendance at lectures (>80%)	categorical	none	Y=yes, N=no	none

This documentation clarifies everything at a glance, preventing misinterpretation and errors. Now we know that study_hrs are weekly hours, test_score is out of 100, and satisfaction is on a 1-5 scale. This makes analysis straightforward and reliable.

Choosing Your Documentation Tool

Before creating your data dictionary, select the tool that best fits your workflow. Here are three common approaches:

Spreadsheet Software (Excel, Google Sheets, LibreOffice)

Best for: Most research teams, easy sharing and collaboration

Create your dictionary table directly in a spreadsheet with columns for each essential component. Save as CSV for better compatibility with R and other analysis software.

Pros: Familiar interface, collaborative editing, easy to share

Cons: Can be tricky to track changes over time

Markdown Tables (in .md or .qmd files)

Best for: Projects using Quarto or reproducible workflows

Use plain text tables that render beautifully in Quarto documents. These are great for keeping your documentation in the same file as your analysis code.

Template:

| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| variable1     |       |             |      |       |              |         |
| variable2     |       |             |      |       |              |         |

Pros: Works seamlessly with Quarto, readable as plain text, keeps documentation with code

Cons: Less intuitive for non-technical collaborators

Word Processors (Word, Google Docs)

Best for: Teams preferring familiar document formats

Create a simple table with the essential columns, similar to spreadsheets but with more formatting options.

| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| species       |       |             |      |       |              |         |
| bill_length_mm|       |             |      |       |              |         |
| sex           |       |             |      |       |              |         |
| year          |       |             |      |       |              |         |

Pros: Rich formatting, familiar to most researchers

Cons: Harder to integrate with R analysis code

Recommendation for This Tutorial

Since you’re learning with Quarto, try creating your data dictionary as a markdown table in a .qmd file. This will help you practice Quarto syntax and see how documentation integrates with your analysis code.

Exercise: Document the Palmer Penguins Data

Now apply what you’ve learned to the Palmer Penguins dataset from the previous section.

Step 1: Load and Examine the Data

Examine the data by either opening the CSV with a text editor or running this code in R to see what you’re working with:

library(readr)
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
head(penguins_clean)

Step 2: Choose Your Documentation Tool

Pick one:

Spreadsheet (Excel, Google Sheets): Open a new spreadsheet file
Markdown (recommended for this tutorial): Create a new file data/penguins_clean_dictionary.qmd in your project folder alongside your data
Word processor: Open a new Word/Google Docs document

Step 3: Create Your Data Dictionary

Create a data dictionary for these 4 variables: species, bill_length_mm, sex, year

For each variable, you’ll document:

Label: Human-readable name
Description: What does this measure or represent?
Type: categorical or numeric?
Units: What are the units (if numeric)?
Valid Values: What values are valid/expected?
Missing: How are missing data coded?

For Spreadsheet Users (Excel, Google Sheets)

Open a new spreadsheet
In row 1, create these 7 column headers:
- Variable Name | Label | Description | Type | Units | Valid Values | Missing
In rows 2-5, add these variable names in the first column:
- species
- bill_length_mm
- sex
- year
Fill in the remaining cells for each variable

For Word Processor Users (Word, Google Docs)

Open a new document
Insert a table with 7 columns and 5 rows (Table → Insert Table → 7x5)
In row 1, add the column headers:
- Variable Name | Label | Description | Type | Units | Valid Values | Missing
In rows 2-5, add these variable names in the first column:
- species
- bill_length_mm
- sex
- year
Fill in the remaining cells for each variable

For Quarto/Markdown Users (.qmd file)

Create a new file quarto file named penguins_clean_dictionary and save it in the data folder of your project.
Copy and paste this markdown table into the file:

| Variable Name  | Label | Description | Type | Units | Valid Values | Missing |
|----------------|-------|-------------|------|-------|--------------|---------|
| species        |       |             |      |       |              |         |
| bill_length_mm |       |             |      |       |              |         |
| sex            |       |             |      |       |              |         |
| year           |       |             |      |       |              |         |

Fill in the empty cells for each variable

Complete this exercise before looking at the solution below!

Solution: Palmer Penguins Data Dictionary

Variable Name	Label	Description	Type	Units	Valid Values	Missing
species	Penguin Species	Species of penguin observed	categorical	none	Adelie, Chinstrap, Gentoo	none
bill_length_mm	Bill Length	Length of the penguin’s bill (culmen) from tip to base	numeric	millimeters	32.1-59.6	NA
sex	Penguin Sex	Biological sex of the penguin	categorical	none	male, female	NA
year	Observation Year	Year the observation was recorded	numeric	year	2007, 2008, 2009	none

Additional context:

Bill length represents the culmen length (the dorsal ridge of the bill)
Missing values (NA) in bill_length_mm and sex occur when measurements couldn’t be obtained
Year is numeric but only 2007-2009 are valid values

When Manual Creation is Most Effective

Manual creation works optimally when:

You have few variables
Your data structure remains stable over time
You need detailed, contextual descriptions
You’re collaborating with non-technical team members
You require precise control over documentation details

For larger datasets manual creation can become tedious. The next section covers automated approaches using R packages, which can generate data dictionaries instantly and update them automatically when your data changes.

References

Hughes, Gail D. 2009. “The Impact of Incorrect Responses to Reverse-Coded Survey Items.” Research in the Schools 16 (2): 76–88.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.