Perturbative Techniques

The core idea behind perturbative techniques is to introduce controlled changes to data values. Unlike non-perturbative techniques that hide or remove information, perturbative methods keep all records and variables in the dataset but alter the values themselves. The key requirement is that the changes should be large enough to prevent re-identification but small enough that the overall statistical properties of the data (means, variances, correlations) are preserved.

Select most relevant techniques (swapping, noise?); potentially shorten or delete others

Add examples to each

Techniques following Carvalho et al. (2023)

Examples of Perturbative Techniques

Swapping

  • exchanging values on certain variables between participants

  • types of swapping

    • record swapping (also known as data swapping): for categorial variables; swapping the values on the variables (e.g., gender, country of residence); t-order equivalence means that the frequency tables for t variables are not changed (e.g., 1-order equivalence: same number of males and females as before, 2-order equivalence: same number of males and females from Switzerland and Germany respectively)

    • rank swapping: for continuous variables; swapping values only within certain range of the rank to limit distortion of data

  • advantages

    • removes relationship between record and individual

    • can be used in one or more sensitive variables without disturbing the non-sensitive variables

    • provides protection to rare and unique values

    • not limited to the type of variable

  • disadvantages

    • may produce number of cases with unusual combinations

    • non-random swapping means work

    • can severely distort statistics for subgroups

    • not useful against attribute disclosures

Re-sampling

  • idea: create averages of independent samples

  • bootstrap independent samples

  • use average of first sample for first row, then average of second sample for second row…

To be more precise: you draw multiple bootstrap samples from the original data, compute the statistic of interest (e.g., the mean) for each sample, and then replace original values with these bootstrapped estimates. The result preserves the overall distribution but individual values no longer correspond to real observations.

check for correct understanding

Noise

  • also known as randomization

  • idea: add more or less random value (additive noise) or multiply by more or less random value (multiplicative noise)

  • noise can be correlated or uncorrelated with values

  • transformations after adding the noise are possible

  • differential privacy methods usually mean noise

TipDifferential Privacy
  • adds noise to data, leading to plausible deniability for any individual

  • results of analysis stay the same independent of noise

  • results of analysis stay the same, independent if one person is in there or not

  • diffpriv: An R Package for Easy Differential Privacy

  • “Even if the attacker already suspects X is the only possible HIV case in the dataset, the data release should not confirm or deny that suspicion.”

Microaggregation

  • idea: create groups of similar values and change these to an aggregate value (e.g., mean, median)

  • works better when groups are more homogeneous

Rounding

  • round values to certain other values

PRAM

  • Post RAndomisation Method
  • values on a categorical variable are recoded with a certain probability

Shuffling

  • variation of swapping
  • generate new sensitive data based on similar distributional properties
  • change the order of sensitive values based on the rank of new sensitive data

Keeping Utility

After applying any perturbative technique, you should compare key statistics (means, standard deviations, correlations, regression coefficients) between the original and perturbed datasets. If the differences are small enough for your purposes, the perturbation has preserved utility. We cover more formal utility measures in the chapter on Balancing Utility and Privacy.

Pro and Contra of Using Perturbative Techniques

Pro:

  • all records remain in the dataset (no deletion)

  • statistical properties can be preserved, flexible

  • ifferent noise levels or group sizes let you tune the privacy-utility trade-off

Con:

  • individual values are no longer accurate (not suitable if exact values matter)

  • risk of reverse-engineering the perturbation if the method is known

  • an distort subgroup statistics if applied too aggressively

Exercise

Perturbative techniques modify values rather than removing or generalising them. The data still looks complete and realistic, but individual values have been altered enough that re-identification becomes unreliable. Two common methods:

  • Microaggregation — records are grouped by similarity and values within each group are replaced by the group mean. Individual values are obscured but aggregate statistics are preserved.
  • Adding noise — small random amounts are added to numeric values. The distribution stays plausible but exact values are no longer trustworthy.

Both are available directly in sdcMicro.

Exercise: Applying Perturbative Techniques

Continue working with sdc_nonpert from the previous exercise.

  1. Apply microaggregation to income using the default method ("mdav"). Use a group size of k = 5.

  2. Add noise to income as an alternative — apply additive noise with a noise level of 0.1 (10% of the standard deviation). Compare the result to microaggregation: which distorts the data more?

  3. Compare the information loss reported by sdcMicro after each step. Which method better balances risk and utility for this variable?

Tip

You cannot undo perturbation steps within the same sdc object. Create a fresh copy of sdc_nonpert before trying the second method so you can compare them side-by-side.

Step 1 – Microaggregation

sdc_micro <- microaggregation(
  obj    = sdc_nonpert,
  variables = "income",
  aggr   = 5,          # group size k
  method = "mdav"      # Maximum Distance to Average Vector
)

print(sdc_micro, type = "numrisk")
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 77.50%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 1029.80
  Difference of Eigenvalues: 6.100%
----------------------------------------------------------------------

mdav groups records by their distance to the group centroid, then replaces each value with the group mean. With aggr = 5, at least 5 records share the same income value, so singling out an individual is harder.

Step 2 – Additive noise (on a fresh copy)

sdc_noise <- addNoise(
  obj    = sdc_nonpert,
  variables = "income",
  noise  = 0.1         # noise level as fraction of SD
)

print(sdc_noise, type = "numrisk")
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 92.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 505.65
  Difference of Eigenvalues: 6.220%
----------------------------------------------------------------------

addNoise() draws from a normal distribution with mean 0 and standard deviation = noise × sd(income) and adds it to each value. Every record gets a unique (slightly wrong) income.

Step 3 – Comparing information loss

# Information loss after microaggregation
il_micro <- get.sdcMicroObj(sdc_micro, "utility")

# Information loss after noise addition
il_noise <- get.sdcMicroObj(sdc_noise, "utility")

il_micro
$il1
[1] 1029.801

$il1s
[1] 22.23844

$eigen
[1] 0.0610226
il_noise
$il1
[1] 505.6452

$il1s
[1] 21.32289

$eigen
[1] 0.06223852

Interpretation: Microaggregation typically produces lower IL1 (mean absolute deviation between original and perturbed values) because entire groups share one value — the average. Noise addition preserves individual-level variation better but introduces random error into every single record. For a dataset where income is used in regression analyses, noise addition is often preferable; for frequency tables or group comparisons, microaggregation is a safer choice.

Save objects

saveRDS(sdc_micro, "../sdc_micro.rds")
saveRDS(sdc_noise, "../sdc_noise.rds")

Learning Objective

  • After completing this part of the tutorial, you will be able to apply selected perturbative techniques in R.

Exercise

  • Apply one or two techniques for certain variables in R
Back to top

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.