Perturbative Techniques
Not only masking of values, but distortion of values
Explain the idea and advantages more
Add a pro and con list?
danger of reverse-engineering the perturbation technique applied
goal: statistics run on the original dataset yield the same results as statistics run on the perturbed dataset
Select most relevant techniques (swapping, noise?); potentially shorten or delete others
Add examples to each
Techniques following Carvalho et al. (2023)
Swapping
exchanging values on certain variables between participants
types of swapping
record swapping (also known as data swapping): for categorial variables; swapping the values on the variables (e.g., gender, country of residence); t-order equivalence means that the frequency tables for t variables are not changed (e.g., 1-order equivalence: same number of males and females as before, 2-order equivalence: same number of males and females from Switzerland and Germany respectively)
rank swapping: for continuous variables; swapping values only within certain range of the rank to limit distortion of data
advantages
removes relationship between record and individual
can be used in one or more sensitive variables without disturbing the non-sensitive variables
provides protection to rare and unique values
not limited to the type of variable
disadvantages
may produce number of cases with unusual combinations
non-random swapping means work
can severely distort statistics for subgroups
not useful against attribute disclosures
Re-sampling
idea: create averages of independent samples
bootstrap independent samples
use average of first sample for first row, then average of second sample for second row…
check for correct understanding
Noise
also known as randomization
idea: add more or less random value (additive noise) or multiply by more or less random value (multiplicative noise)
noise can be correlated or uncorrelated with values
transformations after adding the noise are possible
differential privacy methods usually mean noise
adds noise to data, leading to plausible deniability for any individual
results of analysis stay the same independent of noise
results of analysis stay the same, independent if one person is in there or not
diffpriv: An R Package for Easy Differential Privacy
“Even if the attacker already suspects X is the only possible HIV case in the dataset, the data release should not confirm or deny that suspicion.”
Microaggregation
idea: create groups of similar values and change these to an aggregate value (e.g., mean, median)
works better when groups are more homogeneous
Rounding
- round values to certain other values
PRAM
- Post RAndomisation Method
- values on a categorical variable are recoded with a certain probability
Shuffling
- variation of swapping
- generate new sensitive data based on similar distributional properties
- change the order of sensitive values based on the rank of new sensitive data
Explain how to ensure that the statistics are the same (or reference utility section)
Learning Objective
- After completing this part of the tutorial, you will be able to apply selected perturbative techniques in R.
Exercise
Apply one or two techniques for certain variables in R
Check statistics before and after
Resources, Links, Examples
- adding noise with the
sdcMicroR package
To Do List
- Research best practices on ensuring that statistics stay similar