Welcome to the synthetic data tutorial!

Tutorial Overview

This self-paced tutorial will introduce you to the generation and evaluation of synthetic data. Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate results. Moreover, realistic synthetic can be used in teaching, or for starting with model building when access to the real data is currently still prohibited. All in all, synthetic data makes open science practices easier and might spark collaborations with potential data users.

The tutorial is intended to take approximately 2-3 hours to complete, and is split into the following sections:

  1. Statistical Disclosure Control provides a very brief introduction to statistical disclosure control.
  2. Synthetic data: The general idea conceptually introduces the idea of synthetic data and contains an optional section on coding your own simple synthesizer.
  3. Generating synthetic data introduces the idea of synthetic data and outlines how it can be generated in R.
  4. Evaluating synthetic data quality addresses the privacy-utility trade-off, and discusses how the quality of synthetic data can be evaluated from both sides of this trade-off.

At the end of this tutorial, you will now what synthetic data is and why it is useful, have experience with generating synthetic data, and know how to think about whether the data is fit for release.

Recommended Software

This tutorial assumes you have the following software installed:

Also, the tutorial requires the following R packages:

Before starting, please install the required packages as follows.

install.packages("synthpop")
install.packages("densityratio")
install.packages("mvtnorm")
Back to top