<- readRDS(
data url("https://github.com/lmu-osc/synthetic-data-tutorial/raw/refs/heads/main/data/boys.RDS")
)
library(synthpop)
library(densityratio)
library(ggplot2)
<- syn(
syn_param data = data,
method = "parametric",
seed = 123
)
<- syn_param$method
method "bmi"] <- "~I(wgt/(hgt/100)^2)"
method[
<- syn(
syn_passive data = data,
method = method,
seed = 123,
print = FALSE
)
Evaluating privacy and utility of synthetic data
This section builds on the previous section, and assumes you have completed all exercises. You will need the following output.
Synthetic data utility
The quality of synthetic data can be evaluated on multiple levels and in different ways. Often, three types of utility measures are distinguished (Drechsler and Haensch 2024):
- Fit-for-purpose measures are typically the first step in evaluating synthetic data utility, and evaluates for example whether variable types are consistent with the observed data and whether constraints in the observed data are reproduced in the synthetic data (i.e., non-negative quantities, deterministic relationships like between
hgt
,wgt
andbmi
). - Global utility measures compare the observed and synthetic data on the level of the entire (multivariate) distribution. In theory, if the observed and synthetic data are drawn from the same distribution, any analysis performed on the synthetic data should yield results that are close to results obtained from the observed data. However, in practice, such global measures can be too broad, and high global utility does not guarantee that results of specific analyses are similar between observed and synthetic data.
- Outcome-specific utility measures evaluate utility for a specific analysis. In the context of synthetic data for open science, it would be helpful if results from the observed data can be approximated closely in the synthetic data. Note that specific utility really focuses on a set of analyses, but does not need to transfer to different analyses.
In isolation, none of these measures typically provide an all-encompassing qualification of utility, also because utility is use-case dependent. In many instances, the synthetic data modeller does not know what the data will be used for. In these settings, outcome-specific utility measures are only of limited help, and global utility measures might provide the broadest picture of the utility of the synthetic data. However, for open science purposes, one could argue that fit-for-purpose measures and outcome-specific utility measures are most important, because in many instances, a data user would want to reproduce the original analyses. Note, however, that even if outcome-specific utility is low, ultimately, the synthetic data can be very useful still, as it at least allows to run the original analysis code, and evaluate whether it contains any errors. In what follows, we briefly discuss all three types of measures.
Fit-for-purpose measures
We already evaluated whether the synthetic data looked plausible, by checking that the variables were strictly positive, and that the relationship for bmi
was preserved in the synthetic data. We now go one step further, and compare the marginal distributions of the variables in the synthetic data with the corresponding variables in the observed data.
1. Compare descriptive statistics from syn_passive
with descriptive statistics we calculated in the observed data. What do you see?
Show code: Descriptive statistics of the observed and synthetic data
summary(data)
summary(syn_passive$syn)
age hgt wgt bmi
Min. : 0.035 Min. : 50.0 Min. : 3.14 Min. :11.77
1st Qu.: 1.581 1st Qu.: 84.0 1st Qu.: 11.70 1st Qu.:15.90
Median :10.505 Median :145.8 Median : 34.65 Median :17.46
Mean : 9.159 Mean :131.1 Mean : 37.15 Mean :18.04
3rd Qu.:15.267 3rd Qu.:175.0 3rd Qu.: 59.58 3rd Qu.:19.49
Max. :21.177 Max. :198.0 Max. :117.40 Max. :31.74
hc gen phb tv reg
Min. :33.70 G1:287 P1:316 Min. : 1.000 north: 81
1st Qu.:48.50 G2:162 P2:122 1st Qu.: 2.000 east :161
Median :53.20 G3: 55 P3: 54 Median : 3.000 west :240
Mean :51.64 G4: 87 P4: 63 Mean : 8.309 south:193
3rd Qu.:56.00 G5:157 P5:102 3rd Qu.:15.000 city : 73
Max. :65.00 P6: 91 Max. :25.000
age hgt wgt bmi
Min. : 0.038 Min. : 54.0 Min. : 4.120 Min. : 4.602
1st Qu.: 1.595 1st Qu.: 75.8 1st Qu.: 8.648 1st Qu.:14.983
Median :10.499 Median :146.9 Median :44.700 Median :18.927
Mean : 9.063 Mean :128.9 Mean :37.860 Mean :18.598
3rd Qu.:15.249 3rd Qu.:177.9 3rd Qu.:64.000 3rd Qu.:22.199
Max. :20.813 Max. :193.5 Max. :88.100 Max. :44.669
hc gen phb tv reg
Min. :34.90 G1:221 P1:223 Min. : 1.00 north:108
1st Qu.:44.50 G2:159 P2:134 1st Qu.: 2.00 east :144
Median :54.85 G3: 92 P3: 81 Median : 3.00 west :213
Mean :50.91 G4:115 P4: 87 Mean : 8.66 south:176
3rd Qu.:56.90 G5:161 P5:142 3rd Qu.:15.00 city :107
Max. :60.00 P6: 81 Max. :25.00
All descriptives are quite close, although the categorical variables seem to have somewhat different counts per category in the synthetic data. Perhaps we can finetune our model here still.
2. Use compare()
from the synthpop
package to compare the distributions of the observed data with the syn_passive
data, set the parameters utility.stats = NULL
and utility.for.plot = NULL
. What do you see?
Note: We set utility.stats = NULL
and utiltiy.for.plot = NULL
because here we just focus on the visualiztions. We will use different utility statistics later on.
Show code: Compare synthetic and observed data
compare(
syn_passive,
data, utility.stats = NULL,
utility.for.plot = NULL,
print.flag = FALSE
)
Comparing percentages observed with synthetic
Press return for next variable(s):
Press return for next variable(s):
Marginally, we see that the distributions of the variables are quite okay, but some improvements are possible. For example, for wgt
, quite some values between 10 and 50 kilograms are observed in the real data, but barely any synthetic values fall in this range. Also, bmi
values seem to be over or under estimated. In fact, bmi
values below 10 are very problematic and highly unlikely. All in all, this suggests that the relationship between hgt
and wgt
has not been modelled appropriately.
We further explore the utility of the synthetic data on a multivariate level. Since visual inspection is typically most insightful, we use the plot()
function.
3. Plot the variables age
, hgt
, wgt
, bmi
, hc
and tv
against each other, by calling plot()
on the subset of the data containing these variables. Do the same for the synthetic data.
Show code: Plot observed and synthetic data
plot(
c("age", "hgt", "wgt", "bmi", "hc", "tv")],
data[,main = "Observed data"
)plot(
$syn[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
syn_passivemain = "Synthetic data"
)
We see that the distributions of the synthetic data are much more noisy than the distributions of the observed data. That is, even though linear relationships between variables are more or less preserved, the non-linear relationships are not captured very well, leading to more noisy synthetic data.
Now that we investigate fit-for-purpose utility, we must decide whether the synthetic data is of sufficient quality. By now, you probably suspect that this depends on the use-case at hand. We decide that we attempt to further improve the utility of the synthetic data.
Further synthesis improvements
To improve the utility of the synthetic data, we use regression trees for the continuous variables (the "cart"
model in synthpop
), which is a non-parametric method that is better able to capture non-linear relationships between variables. In short, cart
repetitively splits the predictor space according to which predictor is best able to predict the outcome, leading to so-called leaves in which subsets of the outcome with relatively similar values are collected (see, e.g., Section 8.1.1 in James et al. 2023). Synthetic data is then obtained by splitting the predictors of the synthetic cases according to the learned splits, after which values are sampled from the leaves. Because this approach recycles observed data, we add some smoothing, so that the observed values are not exactly reproduced in the synthetic data. This smoothing is often essential from a privacy-perspective, because it makes sure that synthetic values are not exactly equal to the observed data values.
4. Adjust the previously created method vector by replacing every instance of "normrank"
with "cart"
, and call syn()
with this new method vector and smoothing = "spline"
.
You may again use seed = 123
to replicate our results.
Show code: Synthetic data with cart
== "normrank"] <- "cart"
method[method
<- synthpop::syn(
syn_cart
data,method = method,
smoothing = "spline",
seed = 123,
print.flag = FALSE
)
Variable(s) bmi with passive synthesis: relationship does not hold in data.
Total of 723 case(s) where predictors do not give value present in data.
You might want to recompute the variable(s) in the original data.
We once more get the message that the relationship for bmi
does not hold in the observed data, but we ignore this again.
5. Check whether the synthetic data looks okay using the compare()
function that we used previously.
Show code: Compare synthetic and observed data
compare(
syn_cart,
data, utility.stats = NULL,
utility.for.plot = NULL,
print.flag = FALSE
)
Comparing percentages observed with synthetic
Press return for next variable(s):
Press return for next variable(s):
Marginally, we see that the distributions of the variables are better already than using normrank
. This looks promising.
6. Plot the variables age
, hgt
, wgt
, bmi
, hc
and tv
against each other, by calling plot()
on syn_cart$syn
.
Show code: Plot synthetic data with cart
plot(
c("age", "hgt", "wgt", "bmi", "hc", "tv")],
data[,main = "Observed data"
)plot(
$syn[,c("age", "hgt", "wgt", "bmi", "hc", "tv")],
syn_cartmain = "Synthetic data"
)
Using cart
, the synthetic data looks substantially more like the observed data than the previously created synthetic data, because it is better able to reproduce non-linear relationships between variables.
We now have two synthetic data sets that seem to have different levels of utility. We will further explore this using global and outcome-specific utility measures.
Global utility measures
We evaluate global utility by comparing the distributions of the synthetic data sets with the distribution of the observed data. This is often done using one of two ways.
- The first is to compare the two distributions using some discrepancy measure. The smaller the discrepancy is, the more the synthetic data looks like the observed data.
- Another method is to train a classification model to distinguish between the observed and synthetic data, a method commonly called the propensity mean squared error (\(pMSE\); for more information about this method, see Snoke et al. 2018). The easier it is for the classifier to distinguish between the observed and synthetic data, the lower the global utility, because this implies that the observed and synthetic data are different on some aspects.
We discuss the former approach, as it performed better for evaluating synthetic data utility according to Volker, Wolf, and Kesteren (2024).
Divergence estimation through density ratio estimation
One way to evaluate global utility is by comparing the distributions of the observed and synthetic data directly. This can be done using density ratio estimation, which, like the name suggests, attempts to estimate the ratio of the densities of two groups of samples (note that we mean the same thing with distribution and density, although some nuances exist). That is, we attempt to estimate \[
r(X) = \frac{f_\text{obs}(X)}{f_\text{syn}(X)}.
\] This can be done by estimating the distributions of \(f_\text{obs}(X)\) and \(f_\text{syn}(X)\), but estimating \(r(X)\) directly typically yields better results. The R
-package densityratio
has been developed for this purpose (Volker, Gonzalez Poses, and van Kesteren 2024). Based on the density ratio, we can estimate various divergence measures between the two data sets, which is an in-built functionality.
7. Compare the observed and synthetic data with the ulsif()
function from the densityratio
package. Do this for both synthetic data sets.
Note: Set numerator_data = data
and denominator_data = syn_passive$syn
and denominator_data = syn_cart$syn
.
Show code: Estimate density ratio with ulsif()
<- ulsif(data, syn_passive$syn)
r_passive <- ulsif(data, syn_cart$syn) r_cart
The ulsif()
function performs cross-validation to find the optimal model to estimate the density ratio, so that typically no model specification on behalf of the user is required.
8. Call plot()
on the estimated model objects. This displays the distributions of the estimated density ratio values for the observed and synthetic data.
Hint: The more these distributions overlap, the higher the similarity between the observed and synthetic data.
Show code: Plot density ratio values
plot(r_passive)
plot(r_cart)
Warning: Negative estimated density ratios for 108 observation(s) converted to
0.01 before applying logarithmic transformation
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
You will see that the density ratio values of the syn_passive
data overlap much less with the density ratio values of the observed data than the density ratio values of the syn_cart
values, indicating higher utility.
9. Call summary()
on the density ratio objects to obtain the estimated Pearson divergence between the observed and synthetic data sets.
Show code: Summary of the density ratio objects
summary(r_passive)
summary(r_cart)
Call:
ulsif(df_numerator = data, df_denominator = syn_passive$syn)
Kernel Information:
Kernel type: Gaussian with L2 norm distances
Number of kernels: 200
Optimal sigma: 1.268872
Optimal lambda: 0.0379269
Optimal kernel weights: num [1:201] 0.00433 0.01786 0.09393 0.08489 0.868 ...
Pearson divergence between P(nu) and P(de): 0.9016
For a two-sample homogeneity test, use 'summary(x, test = TRUE)'.
Call:
ulsif(df_numerator = data, df_denominator = syn_cart$syn)
Kernel Information:
Kernel type: Gaussian with L2 norm distances
Number of kernels: 200
Optimal sigma: 4.554126
Optimal lambda: 0.6951928
Optimal kernel weights: num [1:201] 0.02124 0.011 0.00753 0.00421 0.00335 ...
Pearson divergence between P(nu) and P(de): 0.007073
For a two-sample homogeneity test, use 'summary(x, test = TRUE)'.
The Pearson divergence of the syn_cart
data is substantially smaller than the Pearson divergence of the syn_passive
data, indicating higher global utility of the former.
Finally, we can plot the estimated density ratios against the variables in the synthetic data, to see where the synthetic data does not fit the observed data.
10. Use plot_bivariate()
on the fitted density ratio models, and set grid = TRUE
and samples = "denominator"
to only display the synthetic data points.
Hint: Very dark blue values indicate that synthetic data points in those regions are rather scarce in the observed data, whereas dark red values indicate that in some regions, the synthetic data is underrepresented.
Show code: Plot density ratio values against the variables
plot_bivariate(r_passive, grid = TRUE, samples = "denominator")
plot_bivariate(r_cart, grid = TRUE, samples = "denominator")
Warning: Negative estimated density ratios for 108 observation(s) converted to
0.01 before applying logarithmic transformation
The visualizations again show that the normrank
method yielded synthetic data in regions where it was not so much expected, and that the cart
method produces synthetic data with higher utility.
Outcome-specific utility
For outcome-specific utility, we of course need at least one analysis that we expect data users will be interested in. For now, we assume that the data users will be interested in predicting head circumference (hc
), based on age
, height (hgt
), weight (wgt
) and region (reg
), where we include a non-linear effect for age
and hgt
. In this case, high outcome-specific utility implies that the regression coefficients obtained from the synthetic data are close to the coefficients estimated from the observed data. One way to measure “closeness” in this context is via the confidence interval overlap. The confidence interval overlap quantifies to what extent the confidence intervals of the observed and synthetic coefficients overlap: \[
CIO = \frac{1}{2} \Big(
\frac{\min (U_\text{obs}, U_\text{syn}) -
\max (L_\text{obs}, L_\text{syn})}{
U_\text{obs} - L_\text{syn}
} + \frac{\min (U_\text{obs}, U_\text{syn}) -
\max (L_\text{obs}, L_\text{syn})}{
U_\text{syn} - L_\text{obs}
} \Big).
\] The terms \(U\) and \(L\) denote the upper and lower bounds of the \(95\%\) confidence intervals for the observed and synthetic data. Visually, the \(CIO\) can be depicted as follows:
The length between the dashed vertical lines is the quantity in the numerator, whereas the lengths of the two horizontal lines are in the denominators. If the two confidence intervals are almost the same, the confidence interval overlap is large, and the difference between coefficients is relatively small compared to the uncertainty around the coefficients. Conversely, if the overlap is small (or even negative), the coefficients are relatively far apart, relative to the random variation one would expect.
11. Fit a linear regression model with the above specification using lm.synds()
for both synthetic data sets. Compare the results with the observed data using compare.fit.synds()
.
Show code: Regression models
<- lm.synds(
fit_passive ~ age + I(age^2) + hgt + I(hgt^2) + wgt + reg,
hc data = syn_passive
)
<- lm.synds(
fit_cart ~ age + I(age^2) + hgt + I(hgt^2) + wgt + reg,
hc data = syn_cart
)
compare.fit.synds(fit_passive, data = data)
compare.fit.synds(fit_cart, data = data)
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the synthpop package.
Please report the issue to the authors.
Call used to fit models to the data:
lm.synds(formula = hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt +
reg, data = syn_passive)
Differences between results based on synthetic and observed data:
Synthetic Observed Diff Std. coef diff CI overlap
(Intercept) 33.3374137880 7.583220128 25.754193660 23.6277044 -5.0275864
age -0.1867533282 -1.929989279 1.743235951 14.9660859 -2.8179492
I(age^2) 0.0006288912 0.063148825 -0.062519934 -15.4874290 -2.9509474
hgt 0.1562515139 0.663979162 -0.507727648 -25.2331041 -5.4371346
I(hgt^2) -0.0005057012 -0.001868775 0.001363074 19.5251952 -3.9810087
wgt 0.2206018121 0.068799183 0.151802629 14.9315498 -2.8091388
regeast 0.6147377070 -0.008439947 0.623177654 2.7604156 0.2957994
regwest -0.3490120497 -0.074055195 -0.274956854 -1.2924879 0.6702776
regsouth 0.0327038384 -0.034981195 0.067685034 0.3076278 0.9215221
regcity 2.5014409312 0.171141483 2.330299449 8.7308986 -1.2273110
Measures for one synthesis and 10 coefficients
Mean confidence interval overlap: -2.236348
Mean absolute std. coef diff: 12.68625
Mahalanobis distance ratio for lack-of-fit (target 1.0): 150.27
Lack-of-fit test: 1502.707; p-value 0 for test that synthesis model is
compatible with a chi-squared test with 10 degrees of freedom.
Confidence interval plot:
Call used to fit models to the data:
lm.synds(formula = hc ~ age + I(age^2) + hgt + I(hgt^2) + wgt +
reg, data = syn_cart)
Differences between results based on synthetic and observed data:
Synthetic Observed Diff Std. coef diff CI overlap
(Intercept) 8.199799771 7.583220128 0.6165796428 0.56566949 0.8556939
age -1.975133292 -1.929989279 -0.0451440124 -0.38757184 0.9011278
I(age^2) 0.063332400 0.063148825 0.0001835746 0.04547507 0.9883990
hgt 0.647387051 0.663979162 -0.0165921101 -0.82459650 0.7896399
I(hgt^2) -0.001733560 -0.001868775 0.0001352159 1.93688353 0.5058880
wgt 0.051062964 0.068799183 -0.0177362187 -1.74456288 0.5549503
regeast 0.004708119 -0.008439947 0.0131480658 0.05824042 0.9851425
regwest -0.104533244 -0.074055195 -0.0304780487 -0.14326797 0.9634514
regsouth -0.048260585 -0.034981195 -0.0132793900 -0.06035470 0.9846031
regcity 0.006345831 0.171141483 -0.1647956513 -0.61743744 0.8424876
Measures for one synthesis and 10 coefficients
Mean confidence interval overlap: 0.8371383
Mean absolute std. coef diff: 0.638406
Mahalanobis distance ratio for lack-of-fit (target 1.0): 0.82
Lack-of-fit test: 8.212925; p-value 0.608 for test that synthesis model is
compatible with a chi-squared test with 10 degrees of freedom.
Confidence interval plot:
The results show that the syn_cart
data yields substantially higher outcome-specific utility, as the estimated regression coefficients are substantially closer to the estimates from the observed data than the syn_passive
estimates. Moreover, the confidence intervals of the observed data overlap substantially with the confidence intervals of syn_cart
, suggesting that the differences between the estimates are relatively small compared to the uncertainty around the regression coefficients.
One tip for evaluating outcome-specific utility is that if you expect data users are solely interested in analyses in the class of general linear models (like \(t\)-tests, ANOVAs and linear regression models) with merely main effects, modelling the variance-covariance matrix and the means of the observed sufficiently accurate is sufficient. From these characteristics, all regression coefficients or mean differences can be calculated.
Statistical disclosure control
Synthetic data can provide a relatively safe framework for sharing data. However, some risks will remain present, and it is important to evaluate these risks. For example, it can be the case that the synthesis models were so complex that the synthetic records are very similar or even identical to the original records, which can lead to privacy breaches.
Synthetic data by itself does not provide any formal privacy guarantees. These guarantees can be incorporated, for example by using differentially private synthesis methods. However, these methods are not yet widely available in R. If privacy is not built-in by design, it remains important to inspect the synthetic data for potential risks. Especially if you’re not entirely sure, it is better to stay at the safe side: use relatively simple, parametric models, check for outliers, and potentially add additional noise to the synthetic data. See also Chapter 4 in the book Synthetic Data for Official Statistics.
We focus the evaluation of disclosure again on identity disclosure and attribute disclosure, as discussed in the Statistical Disclosure Control section. In short, identity disclosure implies that it is possible to identify records in the synthetic data from a set of known characteristics. Identity disclosure is not typically possible if the entire dataset is comprised of solely synthetic values, but it is still an important aspect of attribute disclosure. Attribute disclosure occurs when it is possible to infer new information from a set of known characteristics. For example, if all individuals above a certain age in some region have a given disease, one might conclude that someone satisfying these constraints also has that disease.
We evaluate identity disclosure using the replicated uniques measure, which refers to unique observations in the synthetic data that were also unique cases in the observed data (on a set of identifying variables; Raab 2024). These identifying variables must be specified by synthetic data modeller. The idea is that observations that have a unique set of identifying variables can be identified by a third party. If these identifying variables re-occur in the synthetic data, and correspond to an actual observation, a third party might conclude that a particular individual is sampled. At the same time, if the synthetic values on at least one of the non-identifying, but potentially sensitive, variables are also the same or very close to the actual values, this might allow for attribute disclosure. These replicated uniques can be removed from the synthetic data, which might not reduce the utility of the synthetic data too much.
Attribute disclosure is measured using a prediction model, and essentially asks the question whether we can correctly predict some target variable from a set of identifying variables. If this is possible in general, knowing someone’s values on a set of identifying variables allows to infer potentially sensitive information on the target variable. In synthpop
, this procedure is implemented as follows (Raab 2024). First, we check which combinations of the identifying variables in the observed data also occur in the synthetic data. Subsequently, we evaluate whether the records with the same values on the identifying variables in the synthetic data also have the same values (or very similar values) on the target variable. Finally, we check whether this value corresponds to the actual value on the target variable in the original data. The proportion of records that meets each of these criteria is referred to as DiSCO (Disclosive in Synthetic, Correct in Original).
Both methods (replicated uniques and DiSCO) can be evaluated in synthpop
using the multi.disclosure()
function.
12. Call the function multi.disclosure()
on the syn_passive
and syn_cart
data sets. Use age
, hgt
and reg
as identifying variables (the keys
argument).
Show code: Statistical disclosure control
multi.disclosure(syn_passive, data, keys = c("age", "hgt", "reg"))
multi.disclosure(syn_cart, data, keys = c("age", "hgt", "reg"))
------------------ 1 wgt -------------------
-------------------Synthesis 1 --------------------
Table for target wgt from GT alone with keys has 523 rows 748 colums.
Table for target wgt from GT & SD with all key combinations has 523 rows 1493 colums.
------------------ 2 bmi -------------------
-------------------Synthesis 1 --------------------
Table for target bmi from GT alone with keys has 524 rows 748 colums.
Table for target bmi from GT & SD with all key combinations has 1260 rows 1493 colums.
------------------ 3 hc -------------------
-------------------Synthesis 1 --------------------
Table for target hc from GT alone with keys has 204 rows 748 colums.
Table for target hc from GT & SD with all key combinations has 204 rows 1493 colums.
------------------ 4 gen -------------------
-------------------Synthesis 1 --------------------
Table for target gen from GT alone with keys has 5 rows 748 colums.
Table for target gen from GT & SD with all key combinations has 5 rows 1493 colums.
------------------ 5 phb -------------------
-------------------Synthesis 1 --------------------
Table for target phb from GT alone with keys has 6 rows 748 colums.
Table for target phb from GT & SD with all key combinations has 6 rows 1493 colums.
------------------ 6 tv -------------------
-------------------Synthesis 1 --------------------
Table for target tv from GT alone with keys has 18 rows 748 colums.
Table for target tv from GT & SD with all key combinations has 18 rows 1493 colums.
Disclosure risk for 748 records in the original data
Identity disclosure measures
from keys: age hgt reg
For original ( UiO ) 100 %
For synthetic ( repU ) 0.4 %.
Table of attribute disclosure measures for age hgt reg
Original measure is Dorig and synthetic measure is DiSCO
Variables Ordered by synthetic disclosure measure
attrib.orig attrib.syn check1 Npairs check2
1 wgt 100 0.00 0
2 bmi 100 0.00 0
3 hc 100 0.00 0
4 gen 100 0.13 0
5 phb 100 0.13 0
6 tv 100 0.13 0
------------------ 1 wgt -------------------
-------------------Synthesis 1 --------------------
Table for target wgt from GT alone with keys has 523 rows 748 colums.
Table for target wgt from GT & SD with all key combinations has 1271 rows 1496 colums.
------------------ 2 bmi -------------------
-------------------Synthesis 1 --------------------
Table for target bmi from GT alone with keys has 524 rows 748 colums.
Table for target bmi from GT & SD with all key combinations has 1272 rows 1496 colums.
------------------ 3 hc -------------------
-------------------Synthesis 1 --------------------
Table for target hc from GT alone with keys has 204 rows 748 colums.
Table for target hc from GT & SD with all key combinations has 952 rows 1496 colums.
------------------ 4 gen -------------------
-------------------Synthesis 1 --------------------
Table for target gen from GT alone with keys has 5 rows 748 colums.
Table for target gen from GT & SD with all key combinations has 5 rows 1496 colums.
------------------ 5 phb -------------------
-------------------Synthesis 1 --------------------
Table for target phb from GT alone with keys has 6 rows 748 colums.
Table for target phb from GT & SD with all key combinations has 6 rows 1496 colums.
------------------ 6 tv -------------------
-------------------Synthesis 1 --------------------
Table for target tv from GT alone with keys has 18 rows 748 colums.
Table for target tv from GT & SD with all key combinations has 25 rows 1496 colums.
Disclosure risk for 748 records in the original data
Identity disclosure measures
from keys: age hgt reg
For original ( UiO ) 100 %
For synthetic ( repU ) 0 %.
Table of attribute disclosure measures for age hgt reg
Original measure is Dorig and synthetic measure is DiSCO
Variables Ordered by synthetic disclosure measure
attrib.orig attrib.syn check1 Npairs check2
1 wgt 100 0 0
2 bmi 100 0 0
3 hc 100 0 0
4 gen 100 0 0
5 phb 100 0 0
6 tv 100 0 0
From the output, it can be seen that disclosure risk is very low for both synthesis methods, according to the criteria defined here. For the syn_passive
data, there are 3 replicated uniques, and for the syn_cart
data, there are none. The figures show that the attribute disclosure measures are equal to zero in the synthetic data.
After evaluating disclosure risks, it is possible to perform post-processing of the synthetic data. This procedure can be streamlined using the sdc()
function in synthpop
. For example, one might remove the replicated uniques from the synthetic data, using the rm.replicated.uniques
argument in sdc()
. Additionally, it is possible to cap outliers using the bottom.top.coding
argument or to smooth variables using smooth.vars
.
In our case, we already applied smoothing, and there are not really any extreme outliers present. Therefore, we only remove the replicated uniques from the data.
13. Apply the sdc()
function with keys = c("age", "hgt", "reg")
and set rm.replicated.uniques = TRUE
and store the results in syn_passive_sdc
.
Hint: This function updates the synthetic data object, make sure to release the synthetic data from the updated object if you use this function.
Show code: Apply statistical disclosure control methods
<- sdc(
syn_passive_sdc
syn_passive, data = data,
keys = c("age", "hgt", "reg"),
rm.replicated.uniques = TRUE,
)
Conclusion
In these practicals, you have learned to create and evaluate synthetic data. We discussed modelling deterministic relationships, refining synthesis models based on utility measures, comparing utility of synthetic data sets, and evaluating remaining disclosure risks. Remember that if you plan to release synthetic data for open science purposes, it is better to stay on the safe side with respect to privacy. When in doubt, choose the option that yields the smallest risk. Also, make clear that the synthetic data is not real data, and that new analyses should thus be conducted with caution. Unless the synthesis model is correctly specified, new analyses are likely to give biased results, and results should thus not be overinterpreted.