Evaluating privacy and utility of synthetic data

Code so far

This section builds on the previous section, and assumes you have completed all exercises. You will need the following output.

data <- readRDS(
    url("https://github.com/lmu-osc/synthetic-data-tutorial/raw/refs/heads/main/data/boys.RDS")
)

library(synthpop)
library(densityratio)
library(ggplot2)

syn_param <- syn(
    data = data,
    method = "parametric",
    seed = 123
)

method <- syn_param$method
method["bmi"] <- "~I(wgt/(hgt/100)^2)"

syn_passive <- syn(
    data = data,
    method = method,
    seed = 123,
    print = FALSE
)

Synthetic data utility

The quality of synthetic data can be evaluated on multiple levels and in different ways. Often, three types of utility measures are distinguished (Drechsler and Haensch 2024):

Fit-for-purpose measures are typically the first step in evaluating synthetic data utility, and evaluates for example whether variable types are consistent with the observed data and whether constraints in the observed data are reproduced in the synthetic data (i.e., non-negative quantities, deterministic relationships like between hgt, wgt and bmi).
Global utility measures compare the observed and synthetic data on the level of the entire (multivariate) distribution. In theory, if the observed and synthetic data are drawn from the same distribution, any analysis performed on the synthetic data should yield results that are close to results obtained from the observed data. However, in practice, such global measures can be too broad, and high global utility does not guarantee that results of specific analyses are similar between observed and synthetic data.
Outcome-specific utility measures evaluate utility for a specific analysis. In the context of synthetic data for open science, it would be helpful if results from the observed data can be approximated closely in the synthetic data. Note that specific utility really focuses on a set of analyses, but does not need to transfer to different analyses.

In isolation, none of these measures typically provide an all-encompassing qualification of utility, also because utility is use-case dependent. In many instances, the synthetic data modeller does not know what the data will be used for. In these settings, outcome-specific utility measures are only of limited help, and global utility measures might provide the broadest picture of the utility of the synthetic data. However, for open science purposes, one could argue that fit-for-purpose measures and outcome-specific utility measures are most important, because in many instances, a data user would want to reproduce the original analyses. Note, however, that even if outcome-specific utility is low, ultimately, the synthetic data can be very useful still, as it at least allows to run the original analysis code, and evaluate whether it contains any errors. In what follows, we briefly discuss all three types of measures.

Fit-for-purpose measures

We already evaluated whether the synthetic data looked plausible, by checking that the variables were strictly positive, and that the relationship for bmi was preserved in the synthetic data. We now go one step further, and compare the marginal distributions of the variables in the synthetic data with the corresponding variables in the observed data.

1. Compare descriptive statistics from syn_passive with descriptive statistics we calculated in the observed data. What do you see?

Show code: Descriptive statistics of the observed and synthetic data

summary(data)
summary(syn_passive$syn)

Show output

      age              hgt             wgt              bmi       
 Min.   : 0.035   Min.   : 50.0   Min.   :  3.14   Min.   :11.77  
 1st Qu.: 1.581   1st Qu.: 84.0   1st Qu.: 11.70   1st Qu.:15.90  
 Median :10.505   Median :145.8   Median : 34.65   Median :17.46  
 Mean   : 9.159   Mean   :131.1   Mean   : 37.15   Mean   :18.04  
 3rd Qu.:15.267   3rd Qu.:175.0   3rd Qu.: 59.58   3rd Qu.:19.49  
 Max.   :21.177   Max.   :198.0   Max.   :117.40   Max.   :31.74  
       hc        gen      phb            tv            reg     
 Min.   :33.70   G1:287   P1:316   Min.   : 1.000   north: 81  
 1st Qu.:48.50   G2:162   P2:122   1st Qu.: 2.000   east :161  
 Median :53.20   G3: 55   P3: 54   Median : 3.000   west :240  
 Mean   :51.64   G4: 87   P4: 63   Mean   : 8.309   south:193  
 3rd Qu.:56.00   G5:157   P5:102   3rd Qu.:15.000   city : 73  
 Max.   :65.00            P6: 91   Max.   :25.000

      age              hgt             wgt              bmi        
 Min.   : 0.038   Min.   : 54.0   Min.   : 4.120   Min.   : 4.602  
 1st Qu.: 1.595   1st Qu.: 75.8   1st Qu.: 8.648   1st Qu.:14.983  
 Median :10.499   Median :146.9   Median :44.700   Median :18.927  
 Mean   : 9.063   Mean   :128.9   Mean   :37.860   Mean   :18.598  
 3rd Qu.:15.249   3rd Qu.:177.9   3rd Qu.:64.000   3rd Qu.:22.199  
 Max.   :20.813   Max.   :193.5   Max.   :88.100   Max.   :44.669  
       hc        gen      phb            tv           reg     
 Min.   :34.90   G1:221   P1:223   Min.   : 1.00   north:108  
 1st Qu.:44.50   G2:159   P2:134   1st Qu.: 2.00   east :144  
 Median :54.85   G3: 92   P3: 81   Median : 3.00   west :213  
 Mean   :50.91   G4:115   P4: 87   Mean   : 8.66   south:176  
 3rd Qu.:56.90   G5:161   P5:142   3rd Qu.:15.00   city :107  
 Max.   :60.00            P6: 81   Max.   :25.00

All descriptives are quite close, although the categorical variables seem to have somewhat different counts per category in the synthetic data. Perhaps we can finetune our model here still.

2. Use compare() from the synthpop package to compare the distributions of the observed data with the syn_passive data, set the parameters utility.stats = NULL and utility.for.plot = NULL. What do you see?

Note: We set utility.stats = NULL and utiltiy.for.plot = NULL because here we just focus on the visualiztions. We will use different utility statistics later on.

Show code: Compare synthetic and observed data

compare(
    syn_passive, 
    data, 
    utility.stats = NULL, 
    utility.for.plot = NULL,
    print.flag = FALSE
)

Show output


Comparing percentages observed with synthetic

Press return for next variable(s):

Press return for next variable(s):

Marginally, we see that the distributions of the variables are quite okay, but some improvements are possible. For example, for wgt, quite some values between 10 and 50 kilograms are observed in the real data, but barely any synthetic values fall in this range. Also, bmi values seem to be over or under estimated. In fact, bmi values below 10 are very problematic and highly unlikely. All in all, this suggests that the relationship between hgt and wgt has not been modelled appropriately.

We further explore the utility of the synthetic data on a multivariate level. Since visual inspection is typically most insightful, we use the plot() function.

3. Plot the variables age, hgt, wgt, bmi, hc and tv against each other, by calling plot() on the subset of the data containing these variables. Do the same for the synthetic data.

Show code: Plot observed and synthetic data

plot(
    data[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
    main = "Observed data"
)
plot(
    syn_passive$syn[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
    main = "Synthetic data"
)

Show output

We see that the distributions of the synthetic data are much more noisy than the distributions of the observed data. That is, even though linear relationships between variables are more or less preserved, the non-linear relationships are not captured very well, leading to more noisy synthetic data.

Now that we investigate fit-for-purpose utility, we must decide whether the synthetic data is of sufficient quality. By now, you probably suspect that this depends on the use-case at hand. We decide that we attempt to further improve the utility of the synthetic data.

Further synthesis improvements

To improve the utility of the synthetic data, we use regression trees for the continuous variables (the "cart" model in synthpop), which is a non-parametric method that is better able to capture non-linear relationships between variables. In short, cart repetitively splits the predictor space according to which predictor is best able to predict the outcome, leading to so-called leaves in which subsets of the outcome with relatively similar values are collected (see, e.g., Section 8.1.1 in James et al. 2023). Synthetic data is then obtained by splitting the predictors of the synthetic cases according to the learned splits, after which values are sampled from the leaves. Because this approach recycles observed data, we add some smoothing, so that the observed values are not exactly reproduced in the synthetic data. This smoothing is often essential from a privacy-perspective, because it makes sure that synthetic values are not exactly equal to the observed data values.

4. Adjust the previously created method vector by replacing every instance of "normrank" with "cart", and call syn() with this new method vector and smoothing = "spline".

You may again use seed = 123 to replicate our results.

Show code: Synthetic data with cart

method[method == "normrank"] <- "cart"

syn_cart <- synthpop::syn(
    data,
    method = method,
    smoothing = "spline",
    seed = 123,
    print.flag = FALSE
)


Variable(s) bmi with passive synthesis: relationship does not hold in data.
Total of 723 case(s) where predictors do not give value present in data.
You might want to recompute the variable(s) in the original data.

We once more get the message that the relationship for bmi does not hold in the observed data, but we ignore this again.

5. Check whether the synthetic data looks okay using the compare() function that we used previously.

Show code: Compare synthetic and observed data

compare(
    syn_cart, 
    data, 
    utility.stats = NULL, 
    utility.for.plot = NULL,
    print.flag = FALSE
)

Show output


Comparing percentages observed with synthetic

Press return for next variable(s):

Press return for next variable(s):

Marginally, we see that the distributions of the variables are better already than using normrank. This looks promising.

6. Plot the variables age, hgt, wgt, bmi, hc and tv against each other, by calling plot() on syn_cart$syn.

Show code: Plot synthetic data with cart

plot(
    data[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
    main = "Observed data"
)
plot(
    syn_cart$syn[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
    main = "Synthetic data"
)

Show output

Using cart, the synthetic data looks substantially more like the observed data than the previously created synthetic data, because it is better able to reproduce non-linear relationships between variables.

We now have two synthetic data sets that seem to have different levels of utility. We will further explore this using global and outcome-specific utility measures.

Global utility measures

We evaluate global utility by comparing the distributions of the synthetic data sets with the distribution of the observed data. This is often done using one of two ways.

The first is to compare the two distributions using some discrepancy measure. The smaller the discrepancy is, the more the synthetic data looks like the observed data.
Another method is to train a classification model to distinguish between the observed and synthetic data, a method commonly called the propensity mean squared error ($pMSE$; for more information about this method, see Snoke et al. 2018). The easier it is for the classifier to distinguish between the observed and synthetic data, the lower the global utility, because this implies that the observed and synthetic data are different on some aspects.

We discuss the former approach, as it performed better for evaluating synthetic data utility according to Volker, Wolf, and Kesteren (2024).

Divergence estimation through density ratio estimation

One way to evaluate global utility is by comparing the distributions of the observed and synthetic data directly. This can be done using density ratio estimation, which, like the name suggests, attempts to estimate the ratio of the densities of two groups of samples (note that we mean the same thing with distribution and density, although some nuances exist). That is, we attempt to estimate \[ r(X) = \frac{f_\text{obs}(X)}{f_\text{syn}(X)}. \] This can be done by estimating the distributions of $f_\text{obs}(X)$ and $f_\text{syn}(X)$, but estimating $r(X)$ directly typically yields better results. The R-package densityratio has been developed for this purpose (Volker, Gonzalez Poses, and van Kesteren 2024). Based on the density ratio, we can estimate various divergence measures between the two data sets, which is an in-built functionality.

7. Compare the observed and synthetic data with the ulsif() function from the densityratio package. Do this for both synthetic data sets.

Note: Set numerator_data = data and denominator_data = syn_passive$syn first, and denominator_data = syn_cart$syn second.

Show code: Estimate density ratio with ulsif()

r_passive <- ulsif(data, syn_passive$syn)
r_cart <- ulsif(data, syn_cart$syn)

The ulsif() function performs cross-validation to find the optimal model to estimate the density ratio, so that typically no model specification on behalf of the user is required.

8. Call plot() on the estimated model objects. This displays the distributions of the estimated density ratio values for the observed and synthetic data.

Hint: The more these distributions overlap, the higher the similarity between the observed and synthetic data.

Show code: Plot density ratio values

plot(r_passive)
plot(r_cart)

Show output

Warning: Negative estimated density ratios for 108 observation(s) converted to
0.01 before applying logarithmic transformation

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

You will see that the density ratio values of the syn_passive data overlap much less with the density ratio values of the observed data than the density ratio values of the syn_cart values, indicating higher utility.

9. Call summary() on the density ratio objects to obtain the estimated Pearson divergence between the observed and synthetic data sets.

Show code: Summary of the density ratio objects

summary(r_passive)
summary(r_cart)

Show output


Call:
ulsif(df_numerator = data, df_denominator = syn_passive$syn)

Kernel Information:
  Kernel type: Gaussian with L2 norm distances
  Number of kernels: 200

Optimal sigma: 1.267926
Optimal lambda: 0.0379269
Optimal kernel weights: num [1:201] 0.00576 0.01799 0.0942 0.08492 0.86873 ...
 
Pearson divergence between P(nu) and P(de): 0.9014
For a two-sample homogeneity test, use 'summary(x, test = TRUE)'.


Call:
ulsif(df_numerator = data, df_denominator = syn_cart$syn)

Kernel Information:
  Kernel type: Gaussian with L2 norm distances
  Number of kernels: 200

Optimal sigma: 4.554113
Optimal lambda: 0.6951928
Optimal kernel weights: num [1:201] 0.02124 0.011 0.00753 0.00421 0.00335 ...
 
Pearson divergence between P(nu) and P(de): 0.007073
For a two-sample homogeneity test, use 'summary(x, test = TRUE)'.

The Pearson divergence of the syn_cart data is substantially smaller than the Pearson divergence of the syn_passive data, indicating higher global utility of the former.

Finally, we can plot the estimated density ratios against the variables in the synthetic data, to see where the synthetic data does not fit the observed data.

10. Use plot_bivariate() on the fitted density ratio models, and set grid = TRUE and samples = "denominator" to only display the synthetic data points.

Hint: Very dark blue values indicate that synthetic data points in those regions are rather scarce in the observed data, whereas dark red values indicate that in some regions, the synthetic data is underrepresented.

Show code: Plot density ratio values against the variables

plot_bivariate(r_passive, grid = TRUE, samples = "denominator")
plot_bivariate(r_cart, grid = TRUE, samples = "denominator")

Show output

Warning: Negative estimated density ratios for 108 observation(s) converted to
0.01 before applying logarithmic transformation

The visualizations again show that the normrank method yielded synthetic data in regions where it was not so much expected, and that the cart method produces synthetic data with higher utility.

Outcome-specific utility

For outcome-specific utility, we need at least one analysis that we expect data users will be interested in. For now, we assume that the data users will be interested in predicting head circumference (hc), based on age, height (hgt), weight (wgt) and region (reg), where we include a non-linear effect for age and hgt. In this case, high outcome-specific utility implies that the regression coefficients obtained from the synthetic data are close to the coefficients estimated from the observed data. One way to measure “closeness” in this context is via the confidence interval overlap. The confidence interval overlap quantifies to what extent the confidence intervals of the observed and synthetic coefficients overlap: \[ CIO = \frac{1}{2} \Big( \frac{\min (U_\text{obs}, U_\text{syn}) - \max (L_\text{obs}, L_\text{syn})}{ U_\text{obs} - L_\text{syn} } + \frac{\min (U_\text{obs}, U_\text{syn}) - \max (L_\text{obs}, L_\text{syn})}{ U_\text{syn} - L_\text{obs} } \Big). \] The terms $U$ and $L$ denote the upper and lower bounds of the $95\%$ confidence intervals for the observed and synthetic data. Visually, the $CIO$ can be depicted as follows:

The length between the dashed vertical lines is the quantity in the numerator, whereas the lengths of the two horizontal lines are in the denominators. If the two confidence intervals are almost the same, the confidence interval overlap is large, and the difference between coefficients is relatively small compared to the uncertainty around the coefficients. Conversely, if the overlap is small (or even negative), the coefficients are relatively far apart, relative to the random variation one would expect.

11. Fit a linear regression model with the above specification using lm.synds() for both synthetic data sets. Compare the results with the observed data using compare.fit.synds().

Show code: Regression models

fit_passive <- lm.synds(
    hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt + reg, 
    data = syn_passive
)

fit_cart <- lm.synds(
    hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt + reg, 
    data = syn_cart
)

compare.fit.synds(fit_passive, data = data)
compare.fit.synds(fit_cart, data = data)

Show output

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the synthpop package.
  Please report the issue to the authors.


Call used to fit models to the data:
lm.synds(formula = hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt + 
    reg, data = syn_passive)

Differences between results based on synthetic and observed data:
                Synthetic     Observed         Diff Std. coef diff CI overlap
(Intercept) 33.3374137880  7.583220128 25.754193660     23.6277044 -5.0275864
age         -0.1867533282 -1.929989279  1.743235951     14.9660859 -2.8179492
I(age^2)     0.0006288912  0.063148825 -0.062519934    -15.4874290 -2.9509474
hgt          0.1562515139  0.663979162 -0.507727648    -25.2331041 -5.4371346
I(hgt^2)    -0.0005057012 -0.001868775  0.001363074     19.5251952 -3.9810087
wgt          0.2206018121  0.068799183  0.151802629     14.9315498 -2.8091388
regeast      0.6147377070 -0.008439947  0.623177654      2.7604156  0.2957994
regwest     -0.3490120497 -0.074055195 -0.274956854     -1.2924879  0.6702776
regsouth     0.0327038384 -0.034981195  0.067685034      0.3076278  0.9215221
regcity      2.5014409312  0.171141483  2.330299449      8.7308986 -1.2273110

Measures for one synthesis and 10 coefficients
Mean confidence interval overlap:  -2.236348
Mean absolute std. coef diff:  12.68625

Mahalanobis distance ratio for lack-of-fit (target 1.0): 150.27
Lack-of-fit test: 1502.707; p-value 0 for test that synthesis model is
compatible with a chi-squared test with 10 degrees of freedom.

Confidence interval plot:


Call used to fit models to the data:
lm.synds(formula = hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt + 
    reg, data = syn_cart)

Differences between results based on synthetic and observed data:
               Synthetic     Observed          Diff Std. coef diff CI overlap
(Intercept)  8.199799771  7.583220128  0.6165796428     0.56566949  0.8556939
age         -1.975133292 -1.929989279 -0.0451440124    -0.38757184  0.9011278
I(age^2)     0.063332400  0.063148825  0.0001835746     0.04547507  0.9883990
hgt          0.647387051  0.663979162 -0.0165921101    -0.82459650  0.7896399
I(hgt^2)    -0.001733560 -0.001868775  0.0001352159     1.93688353  0.5058880
wgt          0.051062964  0.068799183 -0.0177362187    -1.74456288  0.5549503
regeast      0.004708119 -0.008439947  0.0131480658     0.05824042  0.9851425
regwest     -0.104533244 -0.074055195 -0.0304780487    -0.14326797  0.9634514
regsouth    -0.048260585 -0.034981195 -0.0132793900    -0.06035470  0.9846031
regcity      0.006345831  0.171141483 -0.1647956513    -0.61743744  0.8424876

Measures for one synthesis and 10 coefficients
Mean confidence interval overlap:  0.8371383
Mean absolute std. coef diff:  0.638406

Mahalanobis distance ratio for lack-of-fit (target 1.0): 0.82
Lack-of-fit test: 8.212925; p-value 0.608 for test that synthesis model is
compatible with a chi-squared test with 10 degrees of freedom.

Confidence interval plot:

The results show that the syn_cart data yields substantially higher outcome-specific utility, as the estimated regression coefficients are substantially closer to the estimates from the observed data than the syn_passive estimates. Moreover, the confidence intervals of the observed data overlap substantially with the confidence intervals of syn_cart, suggesting that the differences between the estimates are relatively small compared to the uncertainty around the regression coefficients.

One tip for evaluating outcome-specific utility is that if you expect data users are solely interested in analyses in the class of general linear models (like $t$-tests, ANOVAs and linear regression models) with merely main effects, modelling the variance-covariance matrix and the means of the observed data sufficiently accurate is sufficient. From these characteristics, all regression coefficients or mean differences can be calculated.

Statistical disclosure control

Synthetic data can provide a relatively safe framework for sharing data. However, some risks will remain present, and it is important to evaluate these risks. For example, it can be the case that the synthesis models were so complex that the synthetic records are very similar or even identical to the original records, which can lead to privacy breaches.

Privacy of synthetic data

Synthetic data by itself does not provide any formal privacy guarantees. These guarantees can be incorporated, for example by using differentially private synthesis methods. However, these methods are not yet widely available in R. If privacy is not built-in by design, it remains important to inspect the synthetic data for potential risks. Especially if you’re not entirely sure, it is better to stay at the safe side: use relatively simple, parametric models, check for outliers, and potentially add additional noise to the synthetic data. See also Chapter 4 in the book Synthetic Data for Official Statistics.

We focus the evaluation of disclosure again on identity disclosure and attribute disclosure, as discussed in the Statistical Disclosure Control section. In short, identity disclosure implies that it is possible to identify records in the synthetic data from a set of known characteristics. Identity disclosure is not typically possible if the entire dataset is comprised of solely synthetic values, but it is still an important aspect of attribute disclosure. Attribute disclosure occurs when it is possible to infer new information from a set of known characteristics. For example, if all individuals above a certain age in some region have a given disease, one might conclude that someone satisfying these constraints also has that disease.

We evaluate identity disclosure using the replicated uniques measure, which refers to unique observations in the synthetic data that were also unique cases in the observed data (on a set of identifying variables; Raab 2024). These identifying variables must be specified by synthetic data modeller. The idea is that observations that have a unique set of identifying variables can be identified by a third party. If these identifying variables re-occur in the synthetic data, and correspond to an actual observation, a third party might conclude that a particular individual is sampled. At the same time, if the synthetic values on at least one of the non-identifying, but potentially sensitive, variables are also the same or very close to the actual values, this might allow for attribute disclosure. These replicated uniques can be removed from the synthetic data, which might not reduce the utility of the synthetic data too much.

Attribute disclosure is measured using a prediction model, and essentially asks the question whether we can correctly predict some target variable from a set of identifying variables. If this is possible in general, knowing someone’s values on a set of identifying variables allows to infer potentially sensitive information on the target variable. In synthpop, this procedure is implemented as follows (Raab 2024):

First, we check which combinations of the identifying variables in the observed data also occur in the synthetic data.
Subsequently, we evaluate whether the records with the same values on the identifying variables in the synthetic data also have the same values (or very similar values) on the target variable.
Finally, we check whether this value corresponds to the actual value on the target variable in the original data. The proportion of records that meets each of these criteria is referred to as DiSCO (Disclosive in Synthetic, Correct in Original).

Both methods (replicated uniques and DiSCO) can be evaluated in synthpop using the multi.disclosure() function.

12. Call the function multi.disclosure() on the syn_passive and syn_cart data sets. Use age, hgt and reg as identifying variables (the keys argument).

Show code: Statistical disclosure control

multi.disclosure(syn_passive, data, keys = c("age", "hgt", "reg"))
multi.disclosure(syn_cart, data, keys = c("age", "hgt", "reg"))

Show output

------------------ 1 wgt ------------------- 
-------------------Synthesis 1 --------------------
Table for target wgt from GT alone with keys has 523 rows 748 colums.
Table for target  wgt from GT & SD with all key combinations has 523 rows 1493 colums.
------------------ 2 bmi ------------------- 
-------------------Synthesis 1 --------------------
Table for target bmi from GT alone with keys has 524 rows 748 colums.
Table for target  bmi from GT & SD with all key combinations has 1260 rows 1493 colums.
------------------ 3 hc ------------------- 
-------------------Synthesis 1 --------------------
Table for target hc from GT alone with keys has 204 rows 748 colums.
Table for target  hc from GT & SD with all key combinations has 204 rows 1493 colums.
------------------ 4 gen ------------------- 
-------------------Synthesis 1 --------------------
Table for target gen from GT alone with keys has 5 rows 748 colums.
Table for target  gen from GT & SD with all key combinations has 5 rows 1493 colums.
------------------ 5 phb ------------------- 
-------------------Synthesis 1 --------------------
Table for target phb from GT alone with keys has 6 rows 748 colums.
Table for target  phb from GT & SD with all key combinations has 6 rows 1493 colums.
------------------ 6 tv ------------------- 
-------------------Synthesis 1 --------------------
Table for target tv from GT alone with keys has 18 rows 748 colums.
Table for target  tv from GT & SD with all key combinations has 18 rows 1493 colums.

Disclosure risk for 748 records in the original data

Identity disclosure measures
from keys: age hgt reg 
For original  ( UiO )  100 %
For synthetic ( repU ) 0.4 %.

Table of attribute disclosure measures for age hgt reg 
Original measure is  Dorig and synthetic measure is DiSCO 
Variables Ordered by synthetic disclosure measure

      attrib.orig attrib.syn check1 Npairs check2
1 wgt         100       0.00             0       
2 bmi         100       0.00             0       
3 hc          100       0.00             0       
4 gen         100       0.13             0       
5 phb         100       0.13             0       
6 tv          100       0.13             0

------------------ 1 wgt ------------------- 
-------------------Synthesis 1 --------------------
Table for target wgt from GT alone with keys has 523 rows 748 colums.
Table for target  wgt from GT & SD with all key combinations has 1271 rows 1496 colums.
------------------ 2 bmi ------------------- 
-------------------Synthesis 1 --------------------
Table for target bmi from GT alone with keys has 524 rows 748 colums.
Table for target  bmi from GT & SD with all key combinations has 1272 rows 1496 colums.
------------------ 3 hc ------------------- 
-------------------Synthesis 1 --------------------
Table for target hc from GT alone with keys has 204 rows 748 colums.
Table for target  hc from GT & SD with all key combinations has 952 rows 1496 colums.
------------------ 4 gen ------------------- 
-------------------Synthesis 1 --------------------
Table for target gen from GT alone with keys has 5 rows 748 colums.
Table for target  gen from GT & SD with all key combinations has 5 rows 1496 colums.
------------------ 5 phb ------------------- 
-------------------Synthesis 1 --------------------
Table for target phb from GT alone with keys has 6 rows 748 colums.
Table for target  phb from GT & SD with all key combinations has 6 rows 1496 colums.
------------------ 6 tv ------------------- 
-------------------Synthesis 1 --------------------
Table for target tv from GT alone with keys has 18 rows 748 colums.
Table for target  tv from GT & SD with all key combinations has 25 rows 1496 colums.

Disclosure risk for 748 records in the original data

Identity disclosure measures
from keys: age hgt reg 
For original  ( UiO )  100 %
For synthetic ( repU ) 0 %.

Table of attribute disclosure measures for age hgt reg 
Original measure is  Dorig and synthetic measure is DiSCO 
Variables Ordered by synthetic disclosure measure

      attrib.orig attrib.syn check1 Npairs check2
1 wgt         100          0             0       
2 bmi         100          0             0       
3 hc          100          0             0       
4 gen         100          0             0       
5 phb         100          0             0       
6 tv          100          0             0

From the output, it can be seen that disclosure risk is very low for both synthesis methods, according to the criteria defined here. For the syn_passive data, there are 3 replicated uniques, and for the syn_cart data, there are none. The figures show that the attribute disclosure measures are equal to zero in the synthetic data.

After evaluating disclosure risks, it is possible to perform post-processing of the synthetic data. This procedure can be streamlined using the sdc() function in synthpop. For example, one might remove the replicated uniques from the synthetic data, using the rm.replicated.uniques argument in sdc(). Additionally, it is possible to cap outliers using the bottom.top.coding argument or to smooth variables using smooth.vars.

In our case, we already applied smoothing, and there are not really any extreme outliers present. Therefore, we only remove the replicated uniques from the data.

13. Apply the sdc() function with keys = c("age", "hgt", "reg") and set rm.replicated.uniques = TRUE and store the results in syn_passive_sdc.

Hint: This function updates the synthetic data object, make sure to release the synthetic data from the updated object if you use this function.

Show code: Apply statistical disclosure control methods

syn_passive_sdc <- sdc(
    syn_passive, 
    data = data,
    keys = c("age", "hgt", "reg"),
    rm.replicated.uniques = TRUE,
)

Conclusion

In this tutorial, you have learned to create and evaluate synthetic data. We discussed modelling deterministic relationships, refining synthesis models based on utility measures, comparing utility of synthetic data sets, and evaluating remaining disclosure risks. Remember that if you plan to release synthetic data for open science purposes, it is better to stay on the safe side with respect to privacy. When in doubt, choose the option that yields the smallest risk. Also, make clear that the synthetic data is not real data, and that new analyses should thus be conducted with caution. Unless the synthesis model is correctly specified, new analyses are likely to give biased results, and results should thus not be overinterpreted.

References

Drechsler, Jörg, and Anna-Carolina Haensch. 2024. “30 Years of Synthetic Data.” Statistical Science 39 (2): 221–42. https://doi.org/10.1214/24-STS927.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2023. An Introduction to Statistical Learning. 2nd ed. Springer Texts in Statistics. New York: Springer. https://doi.org/10.1007/978-1-0716-1418-1.

Raab, Gillian M. 2024. “Privacy Risk from Synthetic Data: Practical Proposals.” In Privacy in Statistical Databases, edited by Josep Domingo-Ferrer and Melek Önen, 254–73. Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-69651-0_17.

Snoke, Joshua, Gillian M. Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society Series A: Statistics in Society 181 (3): 663–88. https://doi.org/10.1111/rssa.12358.

Volker, Thom Benjamin, Carlos Gonzalez Poses, and Erik-Jan van Kesteren. 2024. “Densityratio: Distribution Comparison Through Density Ratio Estimation.” https://doi.org/10.5281/zenodo.13881689.

Volker, Thom Benjamin, Peter-Paul de Wolf, and Erik-Jan van Kesteren. 2024. “A Density Ratio Framework for Evaluating the Utility of Synthetic Data.” https://arxiv.org/abs/2408.13167.