Concepts & Manual Creation

Without proper documentation, datasets become nearly impossible for others (or future you) to understand and use. In this hands-on section, you’ll learn to create data dictionaries manually and practice with real research data.

NoteWhat You’ll Learn

By the end of this section, you will:

  • Understand what a data dictionary is and why it matters for open science
  • Know the essential components every data dictionary should include
  • Be able to create a data dictionary manually using spreadsheets, Word, or Markdown
  • Have created a partial data dictionary for the Palmer Penguins dataset

What is a Data Dictionary?

A data dictionary is a structured document that provides comprehensive information about each variable in your dataset. Think of it as the “instruction manual” for your data, it tells anyone (including future you) exactly what each column means, how it was measured, and how to interpret the values.

TipConsider This

Before we continue, reflect on your most recent research project. How would a colleague understand your data if you shared it with them today? What questions would they need to ask you?

Essential Components

Every data dictionary should include:

  • Variable Name: The exact column name in your data file (e.g., bill_length_mm)
  • Variable Label: A human-readable name (e.g., “Bill Length”)
  • Description: Clear definition of what was measured (e.g., “Length of the penguin’s bill measured in millimeters”)
  • Data Type: The type of data stored (numeric, character, categorical, date)
  • Units: For numeric variables, the unit of measurement (mm, grams, years)
  • Valid Values: Expected range or categories (e.g., “0-100” or “male, female”)
  • Missing Values: How missing data is coded (NA, -999, blank)

Why Data Dictionaries Matter

Preventing Critical Errors

Without documentation, researchers make dangerous assumptions. Is score a percentage (0-100) or raw points (0-50)? Is treatment coded as 1/0 or A/B? These ambiguities lead to analysis errors that can invalidate entire studies.

NoteReal-World Example

A research team spent months analyzing survey data where satisfaction was coded 1-5. They treated 5 as “very satisfied.” Only later did they discover that 1 was actually the highest satisfaction score. Their conclusions were completely backwards! This type of error with reverse-coded items is well-documented in research and can significantly impact study results (Hughes 2009).

Enabling Collaboration

When a new team member joins your project, a data dictionary gets them productive immediately. They can understand and use your data without hours of explanation or guesswork.

Supporting FAIR Principles

FAIR data must be Findable, Accessible, Interoperable, and Reusable (Wilkinson et al. 2016). A quality data dictionary directly supports all four principles by making your data understandable and usable by others.

TipFAIR Data Management Tutorial

For more on FAIR principles and research data management concepts, check out our FAIR Data Management Tutorial.

A Quick Example

Here’s undocumented data on study habits and academic performance:

student_id study_hrs test_score major satisfaction attend
001 15 85 PSYC 4 Y
002 8 72 BIOL 3 N
003 22 94 PSYC 5 Y
004 NA 68 MATH 2 Y

To make sense of this data, and avoid misinterpretation, we have many questions:

  • What time period do study_hrs cover? Per week? Per month?
  • Is test_score out of 100? What test was this?
  • What does the satisfaction scale mean? Is 1 low or high?
  • What do Y/N in attend represent?
  • Why is there an NA in study_hrs for student 004?

With a data dictionary, these questions disappear. Here’s what documentation looks like:

Variable Name Label Description Type Units Valid Values Missing
student_id Student ID Unique identifier for each participant character none 001-999 none
study_hrs Weekly Study Hours Self-reported hours spent studying per week numeric hours 0-168 NA
test_score Midterm Score Score on standardized midterm exam numeric points 0-100 none
major Academic Major Student’s declared major field categorical none PSYC, BIOL, MATH, CHEM none
satisfaction Course Satisfaction Rating of course satisfaction (5-point scale) numeric 1-5 scale 1=very unsatisfied, 5=very satisfied none
attend Lecture Attendance Regular attendance at lectures (>80%) categorical none Y=yes, N=no none

This documentation clarifies everything at a glance, preventing misinterpretation and errors. Now we know that study_hrs are weekly hours, test_score is out of 100, and satisfaction is on a 1-5 scale. This makes analysis straightforward and reliable.

Choosing Your Documentation Tool

Before creating your data dictionary, select the tool that best fits your workflow. Here are three common approaches:

Spreadsheet Software (Excel, Google Sheets, LibreOffice)

Best for: Most research teams, easy sharing and collaboration

Create your dictionary table directly in a spreadsheet with columns for each essential component. Save as CSV for better compatibility with R and other analysis software.

Pros: Familiar interface, collaborative editing, easy to share

Cons: Can be tricky to track changes over time

Markdown Tables (in .md or .qmd files)

Best for: Projects using Quarto or reproducible workflows

Use plain text tables that render beautifully in Quarto documents. These are great for keeping your documentation in the same file as your analysis code.

Template:

| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| variable1     |       |             |      |       |              |         |
| variable2     |       |             |      |       |              |         |

Pros: Works seamlessly with Quarto, readable as plain text, keeps documentation with code

Cons: Less intuitive for non-technical collaborators

Word Processors (Word, Google Docs)

Best for: Teams preferring familiar document formats

Create a simple table with the essential columns, similar to spreadsheets but with more formatting options.

| Variable Name | Label | Description | Type | Units | Valid Values | Missing |
|---------------|-------|-------------|------|-------|--------------|---------|
| species       |       |             |      |       |              |         |
| bill_length_mm|       |             |      |       |              |         |
| sex           |       |             |      |       |              |         |
| year          |       |             |      |       |              |         |

Pros: Rich formatting, familiar to most researchers

Cons: Harder to integrate with R analysis code

TipRecommendation for This Tutorial

Since you’re learning with Quarto, try creating your data dictionary as a markdown table in a .qmd file. This will help you practice Quarto syntax and see how documentation integrates with your analysis code.

Exercise: Document the Palmer Penguins Data

Now apply what you’ve learned to the Palmer Penguins dataset from the previous section.

Step 1: Load and Examine the Data

Examine the data by either opening the CSV with a text editor or running this code in R to see what you’re working with:

library(readr)
penguins_clean <- read_csv("data/penguins_clean.csv", show_col_types = FALSE)
head(penguins_clean)

Step 2: Choose Your Documentation Tool

Pick one:

  • Spreadsheet (Excel, Google Sheets): Open a new spreadsheet file
  • Markdown (recommended for this tutorial): Create a new file data/penguins_clean_dictionary.qmd in your project folder alongside your data
  • Word processor: Open a new Word/Google Docs document

Step 3: Create Your Data Dictionary

Create a data dictionary for these 4 variables: species, bill_length_mm, sex, year

For each variable, you’ll document:

  • Label: Human-readable name
  • Description: What does this measure or represent?
  • Type: categorical or numeric?
  • Units: What are the units (if numeric)?
  • Valid Values: What values are valid/expected?
  • Missing: How are missing data coded?
TipFor Spreadsheet Users (Excel, Google Sheets)
  1. Open a new spreadsheet
  2. In row 1, create these 7 column headers:
    • Variable Name | Label | Description | Type | Units | Valid Values | Missing
  3. In rows 2-5, add these variable names in the first column:
    • species
    • bill_length_mm
    • sex
    • year
  4. Fill in the remaining cells for each variable
TipFor Word Processor Users (Word, Google Docs)
  1. Open a new document
  2. Insert a table with 7 columns and 5 rows (Table → Insert Table → 7x5)
  3. In row 1, add the column headers:
    • Variable Name | Label | Description | Type | Units | Valid Values | Missing
  4. In rows 2-5, add these variable names in the first column:
    • species
    • bill_length_mm
    • sex
    • year
  5. Fill in the remaining cells for each variable
TipFor Quarto/Markdown Users (.qmd file)
  1. Create a new file quarto file named penguins_clean_dictionary and save it in the data folder of your project.
  2. Copy and paste this markdown table into the file:
| Variable Name  | Label | Description | Type | Units | Valid Values | Missing |
|----------------|-------|-------------|------|-------|--------------|---------|
| species        |       |             |      |       |              |         |
| bill_length_mm |       |             |      |       |              |         |
| sex            |       |             |      |       |              |         |
| year           |       |             |      |       |              |         |
  1. Fill in the empty cells for each variable

Complete this exercise before looking at the solution below!

Variable Name Label Description Type Units Valid Values Missing
species Penguin Species Species of penguin observed categorical none Adelie, Chinstrap, Gentoo none
bill_length_mm Bill Length Length of the penguin’s bill (culmen) from tip to base numeric millimeters 32.1-59.6 NA
sex Penguin Sex Biological sex of the penguin categorical none male, female NA
year Observation Year Year the observation was recorded numeric year 2007, 2008, 2009 none

Additional context:

  • Bill length represents the culmen length (the dorsal ridge of the bill)
  • Missing values (NA) in bill_length_mm and sex occur when measurements couldn’t be obtained
  • Year is numeric but only 2007-2009 are valid values

When Manual Creation is Most Effective

Manual creation works optimally when:

  • You have few variables
  • Your data structure remains stable over time
  • You need detailed, contextual descriptions
  • You’re collaborating with non-technical team members
  • You require precise control over documentation details

For larger datasets manual creation can become tedious. The next section covers automated approaches using R packages, which can generate data dictionaries instantly and update them automatically when your data changes.

Back to top

References

Hughes, Gail D. 2009. “The Impact of Incorrect Responses to Reverse-Coded Survey Items.” Research in the Schools 16 (2): 76–88.
Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.