Non-Perturbative Techniques

Non-perturbative techniques reduce re-identification risk without distorting the underlying values - they either remove information or make it less specific. What stays in the dataset is still true; there is just less of it. This is what sets them apart from perturbative techniques, which add noise or swap values to obscure the original data.

Non-perturbative techniques include all techniques referred to as data masking, suppression, and deletion.

Non-perturbative techniques are usually the right starting point. They are easy to explain, easy to document, and easy to verify. For many research datasets, they are enough on their own.

Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):

Examples of Non-Perturbative Techniques

explain each technique, use research-related terms, and add visualizations

Global Recoding

Global recoding (also called generalization) replaces the exact values of a variable with broader categories - applied to every row in the dataset.

Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade

The key decision is how broadly to generalize. Wider categories reduce risk more, but also lose more information. The right granularity depends on how many people share each combination - which is exactly what k-anonymity measures. You will practice this trade-off in the exercise below.

Local Recoding

Local recoding generalizes values only where needed, rather than across the whole dataset. This is useful when only a small subset of values is rare enough to create risk.

Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.

This means, that values are on different level (in this example: country vs. continent). Compared to global recoding, local recoding is more targeted and loses less information, but it can be harder to justify and document consistently - you need to be transparent about which categories were collapsed and why.

Top-and-Bottom Coding

Top-and-bottom coding is a special case of recoding that applies to numerical variables. Instead of grouping all values into bands, you cap the extreme ends of the distribution - replacing any value above a threshold with that threshold (top coding), or below it (bottom coding). This is useful when the bulk of your data is unremarkable, but a few unusual values are risky because they belong to a small number of people. In our dataset, very high incomes or very long job tenures might apply to only one or two participants, making them easy to single out. Capping income at the 95th percentile, for example, means the highest earners all share the same reported value.

Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”

How to determine the threshold: The threshold can be calculated, but this is not trivial. A common approach is to look at the distribution of the variable and choose a threshold where the remaining values above (or below) it are too few to guarantee anonymity. For example, if only 3 people in your dataset are older than 80, you might top-code at 80. You can also use percentiles - capping at the 95th or 99th percentile is a common starting point. The right threshold depends on your data and the level of k-anonymity you want to achieve.

Suppression

Suppression means removing information entirely - replacing a value with a missing value (NA) or dropping it altogether. This is also known as nulling. It is the most straightforward way to handle data that is too risky to release, but it also loses the most information.

Suppression can occur on different levels:

  • Cell suppression: Remove a single risky value while keeping the rest of the record. For example, if one participant has a unique job title that makes them identifiable, you can set only that cell to NA.

  • Record suppression: Remove an entire participant’s data if their combination of attributes is so unique that no other technique can adequately protect them.

  • Variable suppression: Drop a whole variable from the released dataset if it is too risky to include at all. Direct identifiers like name and email address are a clear case - we already did this in Chapter 1.3.

Cell suppression can also follow a specific condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in the column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013). Find better example

Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.

Suppression is often used as a last resort after recoding: if recoding alone is not enough to bring all records up to the required k-anonymity threshold, targeted cell suppression can close the remaining gaps.

Sampling

If your dataset covers an entire population - for example, all students enrolled in a particular program, or all employees of a specific company - then releasing it in full means that anyone who knows a person was part of that group can try to find their record. Releasing only a random sample of the data breaks this assumption: an attacker can no longer be certain whether a given individual is even in the released dataset.

A sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects).

Sampling is most relevant for administrative datasets and registers. For typical survey-based research, where participation is voluntary and not exhaustively documented, it is less commonly needed. If you do use it, keep in mind that the sample size affects both privacy and the statistical representativeness of the data.

Could aslo be quite effective in case of participation knowledge

Pro and Contra Using Non-Perturbative Techniques

Pro:

  • original data values are preserved (no distortion of statistics within remaining cells)

  • straightforward to implement and explain

  • easy to document

Con:

  • information is lost (removed or coarsened)

  • heavy suppression can make the dataset hard to use

  • may not be sufficient on its own for high-risk data

Applying Non-Perturbative Techniques With R

  • basic data wrangling operations in R or functions in sdcMicro (better than manual changes)

  • but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries; see Chapter DOCUMENTATION)

Exercise: Applying Non-Perturbative Techniques

In this exercise, you will apply non-perturbative techniques to the data to reduce re-identification risk.

The following sdcMicro functions are available to you:

  • globalRecode() — recodes a key variable into broader groups (e.g., exact age → age band)
  • topBotCoding() — caps extreme values at a given threshold
  • localSuppression() — suppresses individual cells in records that fall below a chosen k-anonymity threshold. I would be careful hwne using this function: It does a good job at finding individuals whose record is still unique. But it does not report changes transparently and leads to missing values. Instead, you can recode data manually, e.g., by using mutate, or by using a function such as fct_lump_min() from forcats. This function lumps all categories below a certain group size together.
  1. Look at the key variables in sdc_data (gender, age, education, plz) and decide which technique is appropriate for each. There is no single correct solution - choose thresholds and groupings that make sense given the data, and be prepared to justify your choices.
  2. Apply these techniques using the functions mentioned above.
  3. Compare the new risk summary to the one from the previous exercise.
Tip

The sdcMicro functions operate directly on the sdc object and update risk estimates automatically. Check ?globalRecode, ?topBotCoding, and ?localSuppression for argument details.

The forcat function fct_lump_min() is applied to the dataset itself. The sdcObject needs to be updated afterward.

Re-read for roter Faden

Keep in mind that this is just an example solution - there are many ways to achieve similar levels of privacy.

I started by applying global recoding to age:

sdc_nonpert <- globalRecode(
  obj    = sdc_data,
  column = "age",
  breaks = c(17, 29, 39, 49, 59, Inf), # Define breaks
  labels = c("18-29", "30-39", "40-49", "50-59", "60+")
)  

sdc_nonpert # check summary after making the change
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                  152  (152)     1.316  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                    23    (1)
                     9    (9)
                     1    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 180 (90.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 200 (100.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 200 (100.000%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 100.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 0.00
  Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk") # check risk after making the change
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 0
  in original data: 0
Expected number of re-identifications: 
  in modified data: 190.00 (95.00 %)
  in original data: 199.00 (99.50 %)

I started by dividing the age into five about 10-year breaks. This provides a mean of 40 persons per age break, while the smallest one still has 23. These breaks can be changed later if more privacy or more utility is needed.

We see that k-anonymity values have not been influenced by the recoding of age so far - this is primarily due to the plz variable that is unique for most participants. I therefore decided to generalize the postal code.

As a first step, I used globalRecode() to group postal codes into five broad regions:

sdc_nonpert <- globalRecode(
  obj    = sdc_nonpert,
  column = "plz",
  breaks = c(0, 19999, 39999, 59999, 79999, 99999),
  labels = c('00-19', '20-39', '40-59', '60-79', '80-99')
)

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                    5  (152)    40.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                    23    (1)
                     9    (9)
                    21    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 65 (32.500%) | in original data: 198 (99.000%)
  - 3-anonymity: 105 (52.500%) | in original data: 200 (100.000%)
  - 5-anonymity: 149 (74.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 100.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 0.00
  Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 0
  in original data: 0
Expected number of re-identifications: 
  in modified data: 107.00 (53.50 %)
  in original data: 199.00 (99.50 %)

The new categories map loosely to north-east (00-19), north-central (20-39), west (40-59), south-west (60-79), and south-east (80-99) Germany.

Looking at k-anonymity now, we see that there are still many unique observations. To reduce risk further, I will generalize plz even more. For this dataset, I mostly want to use plz to distinguish whether participants live in the Munich area or elsewhere in Germany. Postal codes starting with 8 cover roughly the areas surrounding Munich, so I will collapse plz into just two categories: "8xxxx" and "other".

globalRecode() cannot do this directly since cut() only handles contiguous intervals - and “other” spans two separate ranges (00000-79999 and 90000-99999). Instead, I directly modify the @manipKeyVars slot of the sdcObject and recalculate risks:

library(forcats)

sdc_nonpert@manipKeyVars$plz <- fct_collapse(
  sdc_nonpert@manipKeyVars$plz,
  "8xxxx" = c("80-99"),
  "other" = c("00-19", "20-39", "40-59", "60-79")
)

sdc_nonpert <- calcRisks(sdc_nonpert) # After these manual changes, you have to update the risks on the sdcObject

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    3    (3)    66.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    40.000 (40.000)
          plz                    2  (152)   100.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                    10   (10)
                    23    (1)
                     9    (9)
                    91    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 32 (16.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 60 (30.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 105 (52.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 100.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 0.00
  Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 32
  in original data: 0
Expected number of re-identifications: 
  in modified data: 74.00 (37.00 %)
  in original data: 199.00 (99.50 %)

Currently, we are at only 32 unique persons in our dataset and 60 persons that violate 2-anonymity. I don’t want to generalize the keyVars any further and for that reason decide to apply local suppression.

sdc_nonpert <- localSuppression(
  sdc_nonpert,
  k = 2, # define wanted k-anonymity level, k = 2 is the default
  importance = c(2, 1, 4, 3) # this defines the order of importance; the algorithm begins suppressing on the variable marked as 4
  )

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    66.000 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    6    (5)    34.400 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     8   (10)
                    23    (1)
                     2    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 16 (8.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 59 (29.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 100.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 0.00
  Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                2      |            1.000
       age      |                0      |            0.000
 education      |               28      |           14.000
       plz      |                2      |            1.000
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 16
  in original data: 0
Expected number of re-identifications: 
  in modified data: 40.23 (20.12 %)
  in original data: 199.00 (99.50 %)

The summary of the sdcObject shows us what the last local suppression step has achieved: It suppressed 28 values on the education variable and 2 each on gender and postal code. Now, 2-anonymity is reached for all participants. 16 entries violate 3-anonymity; I will try local suppression at k = 3 to see how much more suppression would be necessary to achieve this.

sdc_nonpert <- localSuppression(
  sdc_nonpert,
  k = 3, # define wanted k-anonymity level, k = 2 is the default
  importance = c(2, 1, 4, 3) # this defines the order of importance; the algorithm begins suppressing on the variable marked as 4
  )

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 100.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 0.00
  Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 29
  in original data: 0
Expected number of re-identifications: 
  in modified data: 30.27 (15.13 %)
  in original data: 199.00 (99.50 %)

With 12 more suppressions on education and 4 on gender, we can achieve 3-anonymity. Only 29 entries remain, that violate 5-anonymity. To me, this is enough protection on k-anonymity.

Now that the key variables are in their final state, I applied top-coding to income and years in job.

My goal with income is to protect the few individuals with a very high income. I start by inspecting the distribution:

library(ggplot2)

ggplot(data_withoutdirectidentifiers, aes(y = income)) +
  geom_boxplot() +
  theme_minimal()

I see a few extreme outliers above 400 000 €. Choosing the 95th percentile at 8.9185848^{4}€ cuts these off. I’ll use topBotCoding() and set the value at the 95th percentile, replacing those values with that one.

sdc_nonpert <- topBotCoding(
  obj         = sdc_nonpert,
  column      = "income",
  value       = quantile(data_withoutdirectidentifiers$income, 0.95),
  replacement = quantile(data_withoutdirectidentifiers$income, 0.95),
  kind        = "top"
)

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 95.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 346.87
  Difference of Eigenvalues: 6.600%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 29
  in original data: 0
Expected number of re-identifications: 
  in modified data: 30.27 (15.13 %)
  in original data: 199.00 (99.50 %)

For years in job, I start by familiarizing myself with the data:

summary(data_withoutdirectidentifiers$years_in_job)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    4.00    5.34    8.00   32.00 
ggplot(data_withoutdirectidentifiers, aes(y = years_in_job)) +
  geom_boxplot() +
  theme_minimal()

Based on the visualization, I decide to use 15 as a cut-off value, as this cuts off the extreme values.

sdc_nonpert <- topBotCoding(
  obj         = sdc_nonpert,
  column      = "years_in_job",
  value       = 15,
  replacement = 15,
  kind        = "top"
)

sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
  --> Categorical key variables: gender, age, education, plz
  --> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:

Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
 Key Variable Number of categories        Mean size         
       <char>               <char> <char>    <char>   <char>
       gender                    4    (3)    64.667 (66.667)
          age                    5   (52)    40.000  (3.846)
    education                    5    (5)    32.000 (40.000)
          plz                    3  (152)    99.000  (1.316)
 Size of smallest (>0)       
                <char> <char>
                     4   (10)
                    23    (1)
                     5    (9)
                    90    (1)
----------------------------------------------------------------------
Infos on 2/3-Anonymity:

Number of observations violating
  - 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
  - 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
  - 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)

----------------------------------------------------------------------
Numerical key variables: income, years_in_job

Disclosure risk (~100.00% in original data):
  modified data: [0.00%; 92.00%]

Current Information Loss in modified data (0.00% in original data):
  IL1: 495.71
  Difference of Eigenvalues: 6.200%
----------------------------------------------------------------------
Local suppression:
    KeyVar      | Suppressions (#)      | Suppressions (%)
    <char> <char>            <int> <char>           <char>
    gender      |                4      |            2.000
       age      |                0      |            0.000
 education      |               12      |            6.000
       plz      |                0      |            0.000
----------------------------------------------------------------------
print(sdc_nonpert, type = "risk")
Risk measures:

Number of observations with higher risk than the main part of the data: 
  in modified data: 29
  in original data: 0
Expected number of re-identifications: 
  in modified data: 30.27 (15.13 %)
  in original data: 199.00 (99.50 %)

These techniques reduce the disclosure risk to up to 92% compared tot the original data.

Now I can extract the dataset. Because sdcMicro stores recoded key variables as integer codes internally, I need to restore the labels manually using mutate:

data_nonpert <- extractManipData(sdc_nonpert)

# Restore labels for variables recoded inside sdcMicro
age_labels    <- c("18-29", "30-39", "40-49", "50-59", "60+")
plz_labels    <- c("8xxxx", "other")

data_nonpert <- data_nonpert %>%
  mutate(
    age = factor(age, levels = seq_along(age_labels), labels = age_labels),
    plz = factor(plz, levels = seq_along(plz_labels), labels = plz_labels)
  )


data_nonpert
      X  id   plz     gender   age   income years_in_job          religion
1     1   1 other     female 18-29  8471.35            3       Catholicism
2     2   2 other     female 30-39 34721.35            4              None
3     3   3 other       male   60+ 89185.85            0              None
4     4   4 other     female   60+ 59490.47            5              None
5     5   5 8xxxx       male 18-29 52337.39            1             Islam
6     6   6 other       male   60+ 60960.23            6     Protestantism
7     7   7 other       male 30-39 61037.20           11              None
8     8   8 8xxxx     female 30-39 58720.93            2          Buddhism
9     9   9 8xxxx       male 30-39 55625.01            1       Catholicism
10   10  10 other       male 18-29 75996.23            4       Catholicism
11   11  11  <NA> non-binary 40-49 29760.83            6       Catholicism
12   12  12 8xxxx     female 30-39 43369.95            9              None
13   13  13 other       male 30-39 70224.28            3              None
14   14  14 other     female 40-49 43707.07            2             Islam
15   15  15 other     female   60+ 50073.94            5             Islam
16   16  16 other     female 30-39 53897.71            9       Catholicism
17   17  17 8xxxx       male 50-59 47372.98            8       Catholicism
18   18  18 8xxxx     female   60+ 53557.83            8           Judaism
19   19  19 other       male 30-39 88933.73            4             Islam
20   20  20 8xxxx     female 50-59 33516.81            2              None
21   21  21 other     female 50-59 60266.05            3              None
22   22  22 other     female 18-29  3234.65           11              None
23   23  23 other     female 40-49 18647.76            0     Protestantism
24   24  24 other     female 40-49 53198.80            0       Catholicism
25   25  25 other       male 50-59 66750.89            8     Protestantism
26   26  26 8xxxx     female 40-49 33275.84            2       Catholicism
27   27  27 other       male 40-49 43974.11            0       Catholicism
28   28  28 other     female 30-39 56904.06            2              None
29   29  29 other     female 18-29 49785.63            2             Islam
30   30  30 8xxxx     female 18-29 65838.27            4              None
31   31  31 8xxxx     female 40-49 43510.82            2       Catholicism
32   32  32 other     female 18-29 35050.43           11     Protestantism
33   33  33 other       male 18-29 45612.60            6              None
34   34  34 8xxxx     female 18-29 74064.45            2             Islam
35   35  35 8xxxx     female 18-29 19700.02            1              None
36   36  36 other     female 18-29 35491.71            1     Protestantism
37   37  37 8xxxx     female 18-29 31146.67            3     Protestantism
38   38  38 other     female 18-29 58538.33            1              None
39   39  39 8xxxx       <NA> 30-39 25565.51           10              None
40   40  40 other       male 50-59 53781.80           12              None
41   41  41 8xxxx       male 30-39 48621.36            6              None
42   42  42 8xxxx     female 50-59 52099.26            1              None
43   43  43 8xxxx       male 18-29 80544.39            5              None
44   44  44 8xxxx       male 40-49 44352.87            0              None
45   45  45 8xxxx       male 18-29 36923.45            5       Catholicism
46   46  46 other     female   60+ 89185.85            8       Catholicism
47   47  47 8xxxx       male 50-59 82695.22            6              None
48   48  48 8xxxx       male   60+ 89185.85           15     Protestantism
49   49  49 8xxxx       <NA> 18-29 43699.08            3     Protestantism
50   50  50 8xxxx       male 18-29 55256.89            8       Catholicism
51   51  51 8xxxx     female 18-29 57280.41            2     Protestantism
52   52  52 other     female 40-49 65257.71           15       Catholicism
53   53  53 other       male 30-39 73130.44            5     Protestantism
54   54  54 8xxxx     female 50-59 77231.91            5             Islam
55   55  55 other non-binary 40-49 89185.85            4              None
56   56  56 8xxxx     female 50-59 49240.52            6              None
57   57  57 8xxxx       male 18-29 49181.85            1     Protestantism
58   58  58 8xxxx     female 40-49 45667.06           11       Catholicism
59   59  59 8xxxx       male 18-29 71020.92            3       Catholicism
60   60  60 other     female 50-59 17867.79            0              None
61   61  61 other       male 30-39 59154.65            7              None
62   62  62 8xxxx       male 18-29 34867.09            1     Protestantism
63   63  63 8xxxx       male   60+ 89185.85            9     Protestantism
64   64  64 8xxxx     female 40-49  6537.27            4             Islam
65   65  65 8xxxx       <NA> 18-29 53963.61            5     Protestantism
66   66  66 other       male 30-39 53233.39            0              None
67   67  67 other     female   60+ 35192.39            0       Catholicism
68   68  68 8xxxx       male 40-49 88998.52            7     Protestantism
69   69  69 8xxxx     female 30-39 76185.54           15              None
70   70  70 8xxxx       male 18-29 15277.69            1       Catholicism
71   71  71 8xxxx     female 50-59 32358.41            4     Protestantism
72   72  72 8xxxx     female 50-59 59564.29            0       Catholicism
73   73  73 8xxxx       male 18-29 44946.10            1              None
74   74  74 8xxxx       male 50-59 66701.73           10           Judaism
75   75  75 8xxxx     female 30-39 52378.45           13     Protestantism
76   76  76 8xxxx     female 30-39 28532.02            1     Protestantism
77   77  77 other       male 40-49 56000.53            5     Protestantism
78   78  78 8xxxx       male 18-29 69690.09           11              None
79   79  79 8xxxx     female 50-59 46431.85            2     Protestantism
80   80  80 8xxxx     female   60+ 85894.23            4              None
81   81  81 other     female 40-49 59395.77            3     Protestantism
82   82  82 8xxxx     female 18-29 50677.57            0     Protestantism
83   83  83 other       male   60+ 19523.64            3             Islam
84   84  84 other     female 30-39 60402.28            5     Protestantism
85   85  85 8xxxx       male 40-49 73762.07           15              None
86   86  86 other       male 50-59 57488.83            2       Catholicism
87   87  87 other     female 30-39 48387.34            4       Catholicism
88   88  88 other     female 40-49 51020.65           15       Catholicism
89   89  89 8xxxx       <NA> 30-39 89185.85            8              None
90   90  90 other     female 40-49 52973.81            5       Catholicism
91   91  91 8xxxx     female 18-29 65059.84            5     Protestantism
92   92  92 other     female 40-49 89185.85            5       Catholicism
93   93  93 other     female 50-59 60871.39           10       Catholicism
94   94  94 other       male 18-29 38100.28            1              None
95   95  95 other     female   60+  1508.25            1              None
96   96  96 8xxxx     female 40-49 35839.71            2       Catholicism
97   97  97 8xxxx       male   60+ 33524.46            3              None
98   98  98 other     female 40-49 89185.85           10     Protestantism
99   99  99 other       male 40-49 37535.39           15              None
100 100 100 other       male 40-49 70163.08            3       Catholicism
101 101 101 8xxxx       male 18-29 44974.20            2          Buddhism
102 102 102 other       male 30-39 65006.97            5       Catholicism
103 103 103 8xxxx       male 40-49 28030.40           13     Protestantism
104 104 104 8xxxx       male 40-49 56143.31            1     Protestantism
105 105 105 8xxxx       male 18-29 21741.10           12       Catholicism
106 106 106 other     female 18-29 61855.98            1              None
107 107 107 other     female 18-29 15326.24            1              None
108 108 108 8xxxx       male   60+  9709.52            5              None
109 109 109 other     female 40-49 25659.34            5       Catholicism
110 110 110 other     female 50-59 69044.93            3       Catholicism
111 111 111 8xxxx       male 30-39 66301.02            3     Protestantism
112 112 112 8xxxx     female 30-39 27090.40            2              None
113 113 113 other       male 40-49 45324.62            4              None
114 114 114 other       male   60+ 50849.11            5       Catholicism
115 115 115 8xxxx     female   60+ 30421.29           13              None
116 116 116 other     female 40-49 50230.45            3     Protestantism
117 117 117 8xxxx       male 18-29 37902.42            6              None
118 118 118 other     female 18-29 55949.20            1       Catholicism
119 119 119 other       male 40-49 53509.42            0              None
120 120 120 8xxxx     female   60+ 57558.17            2       Catholicism
121 121 121 other       male 30-39 67904.73            3     Protestantism
122 122 122 other       male 18-29 37003.44            1              None
123 123 123 8xxxx     female 40-49 74012.32            0              None
124 124 124 8xxxx       male 40-49 54693.74            0              None
125 125 125 other     female   60+ 36977.03           12              None
126 126 126 other       male 30-39 36546.40            2              None
127 127 127 other       male 30-39 84045.58           10       Catholicism
128 128 128 other     female 18-29 82138.06            1              None
129 129 129 8xxxx     female 50-59 30733.31            1              None
130 130 130 8xxxx       male 40-49 61673.59            3     Protestantism
131 131 131 8xxxx     female 50-59 69536.55            1     Protestantism
132 132 132 8xxxx       male 30-39 48347.66            6     Protestantism
133 133 133 8xxxx       male 30-39 66190.75            0     Protestantism
134 134 134 other       male   60+ 66078.75           11              None
135 135 135 8xxxx     female 50-59 18269.69            1              None
136 136 136 other non-binary 40-49 59707.29            1     Protestantism
137 137 137 8xxxx       male 40-49 52099.71           15              None
138 138 138 8xxxx     female 18-29 34656.89            4     Protestantism
139 139 139 8xxxx       male 40-49 34257.44           10       Catholicism
140 140 140 8xxxx     female 30-39 24074.20           14              None
141 141 141 8xxxx       male 40-49 64152.22            1       Catholicism
142 142 142 other     female 40-49 83251.70            2     Protestantism
143 143 143 8xxxx     female 40-49 64440.78           13     Protestantism
144 144 144 8xxxx     female 18-29 47105.72            9              None
145 145 145 other     female 40-49 36879.16            3       Catholicism
146 146 146 other       male 30-39 42220.17           15     Protestantism
147 147 147 8xxxx       male 30-39 63960.04            0              None
148 148 148 8xxxx       male 50-59  3530.08            9              None
149 149 149 other     female 30-39 13668.60            4             Islam
150 150 150 other       male 40-49 50377.45            7     Protestantism
151 151 151 8xxxx     female 40-49 61128.01            3       Catholicism
152 152 152 8xxxx       male 18-29 43719.31            4              None
153 153 153 other     female 40-49 54589.13           10              None
154 154 154 other     female 50-59 19115.90           10              None
155 155 155 8xxxx     female 40-49 10024.02            2       Catholicism
156 156 156 other       male 18-29 85761.37            7       Catholicism
157 157 157  <NA>       <NA>   60+ 82607.20            3       Catholicism
158 158 158 other       male 50-59 44383.80            1              None
159 159 159 8xxxx     female 30-39 66185.55           15       Catholicism
160 160 160 other       male 50-59 46443.35           11     Protestantism
161 161 161 8xxxx     female 50-59 41571.20            9     Protestantism
162 162 162 other     female 40-49 48002.36            1       Catholicism
163 163 163 8xxxx     female 40-49 64303.37            0     Protestantism
164 164 164 8xxxx     female 50-59 63828.73           10       Catholicism
165 165 165 other       male 18-29 62158.38            1             Islam
166 166 166 8xxxx     female 30-39 36173.97            6              None
167 167 167 other     female 40-49 22233.87            0             Islam
168 168 168 8xxxx       male 18-29 65795.96            2       Catholicism
169 169 169 other     female 18-29 50500.79            1              None
170 170 170 other     female 40-49 34635.22            6              None
171 171 171 8xxxx     female   60+ 51093.90           11              None
172 172 172 8xxxx       male 30-39 38140.07           13              None
173 173 173 other non-binary 40-49 83316.86            2       Catholicism
174 174 174 other     female 40-49 89185.85           10              None
175 175 175 8xxxx       male 40-49 47266.68           13     Protestantism
176 176 176 8xxxx     female 18-29 38736.53            5       Catholicism
177 177 177 8xxxx       male 30-39 51254.30            9       Catholicism
178 178 178 8xxxx     female 40-49  3070.65            2             Islam
179 179 179 8xxxx       male 50-59 67298.52            7              None
180 180 180 8xxxx       male 40-49 52914.99            6     Protestantism
181 181 181 other     female   60+ 53761.55            3              None
182 182 182 8xxxx       male 40-49 62667.11           10       Catholicism
183 183 183 8xxxx     female 18-29 69837.87            1     Protestantism
184 184 184 other       male 30-39 48027.70            9              None
185 185 185 8xxxx       male 50-59 68760.98            6       Catholicism
186 186 186 8xxxx     female 18-29 53996.54            6              None
187 187 187 8xxxx       <NA>   60+ 29089.31            2     Protestantism
188 188 188 8xxxx     female 18-29 58074.39            0     Protestantism
189 189 189 8xxxx       male 30-39 43647.51            5          Buddhism
190 190 190 8xxxx     female 40-49 59911.66           15       Catholicism
191 191 191 other     female 40-49 15015.81            4              None
192 192 192 8xxxx       male 30-39 48884.82            3       Catholicism
193 193 193 other     female 30-39 11665.95            5     Protestantism
194 194 194 other       male 40-49 51434.15            6 Eastern Orthodoxy
195 195 195 8xxxx       male 30-39 73230.95           10       Catholicism
196 196 196 8xxxx       male 50-59 52969.14            5              None
197 197 197 other     female 30-39 38514.53           11              None
198 198 198 8xxxx     female 40-49 89185.85            0     Protestantism
199 199 199 8xxxx     female 30-39 58564.02            4       Catholicism
200 200 200 other     female 40-49 30312.73            3       Catholicism
                                                      job_title      education
1                                      Local government officer   trade school
2                                           Structural engineer    high school
3                               Psychotherapist, dance movement    high school
4                                        Fitness centre manager    high school
5                 Programme researcher, broadcasting/film/video    high school
6                                        Chief Strategy Officer    high school
7                                      Engineer, communications    high school
8                                       Secretary/administrator     university
9                                                  Video editor   trade school
10                                                Hotel manager    high school
11                                                    Herbalist     university
12                                      Teacher, primary school    high school
13                             Production assistant, television    high school
14                                             Surveyor, mining    high school
15                                               Data scientist    high school
16                                     Programmer, applications    high school
17                                   Horticulturist, commercial   trade school
18                             Training and development officer    high school
19                                             Catering manager    high school
20                                             Textile designer           <NA>
21                                   Designer, fashion/clothing    high school
22                                            Medical physicist    high school
23                                          Seismic interpreter           <NA>
24                                          Biomedical engineer    high school
25                                          Biomedical engineer           <NA>
26                                             Technical brewer     university
27                                     Advertising art director    high school
28                                Clothing/textile technologist    high school
29                                                Stage manager     university
30                                     Advertising art director    high school
31                                                IT consultant     university
32                                     Secondary school teacher   trade school
33                                           Restaurant manager           <NA>
34                                       Politician's assistant   trade school
35                                                  Illustrator    high school
36                                         Community pharmacist    high school
37                                                    Solicitor           <NA>
38                                     Local government officer    high school
39                                                    Osteopath           <NA>
40                                          Medical illustrator    high school
41                                                      Surgeon   trade school
42                                         Financial controller    high school
43                                              Systems analyst    high school
44                                             Sports therapist    high school
45                    Sound technician, broadcasting/film/video doctoral title
46                                       Amenity horticulturist    high school
47                                                  Firefighter           <NA>
48                                                         Copy    high school
49                            Sales promotion account executive   trade school
50                                     Public relations officer   trade school
51                                            Financial planner    high school
52                                       Museum/gallery curator    high school
53                                                       Dealer    high school
54                             Higher education careers adviser    high school
55                                      General practice doctor           <NA>
56                                       Recruitment consultant           <NA>
57                             Training and development officer    high school
58                                                    Mudlogger    high school
59                                         Pharmacist, hospital    high school
60                                     Horticultural consultant    high school
61                                                   Ergonomist    high school
62                                         Chartered accountant           <NA>
63                                                 Psychiatrist           <NA>
64                                      Data processing manager    high school
65                                         Broadcast journalist   trade school
66             Administrator, charities/voluntary organisations    high school
67                                Speech and language therapist    high school
68                                               Energy manager    high school
69                                        Editor, commissioning    high school
70                                Advertising account executive    high school
71                                     Surveyor, land/geomatics    high school
72                                        Exercise physiologist    high school
73                                                 Risk manager   trade school
74                                                 Risk manager    high school
75                                              Games developer    high school
76                                                  Illustrator     university
77                                Speech and language therapist           <NA>
78                                                       Gaffer    high school
79                                             Heritage manager    high school
80                                            Buyer, industrial    high school
81                                                 Psychiatrist     university
82                                         Public house manager     university
83                             Production assistant, television           <NA>
84                                           Wellsite geologist    high school
85                                                       Dealer    high school
86                                       Manufacturing engineer    high school
87                                         Engineer, electrical           <NA>
88                                         Brewing technologist     university
89                                            Social researcher   trade school
90                                  Advertising account planner    high school
91                                         Community pharmacist    high school
92                                       Conservator, furniture           <NA>
93                          Designer, blown glass/stained glass           <NA>
94                       Scientist, product/process development    high school
95                                                       Dealer           <NA>
96                                        Insurance underwriter    high school
97                                                      Curator    high school
98        Armed forces logistics/support/administrative officer     university
99                                      Chief Executive Officer           <NA>
100                                      Special effects artist   trade school
101                                              Quarry manager doctoral title
102                                           Therapist, sports    high school
103                             Chartered management accountant           <NA>
104                                            Graphic designer   trade school
105                                          Professor Emeritus    high school
106                             Runner, broadcasting/film/video    high school
107                                       Forensic psychologist   trade school
108                                        Engineer, electrical           <NA>
109                                         Designer, furniture     university
110                             Engineer, manufacturing systems    high school
111                                             Patent attorney    high school
112                               Research officer, trade union     university
113                                  Museum/gallery conservator    high school
114                              Furniture conservator/restorer    high school
115                                    Television floor manager     university
116                                    Print production planner    high school
117                                            Financial trader    high school
118                                                Estate agent    high school
119                                   Trading standards officer   trade school
120                                     Housing manager/officer     university
121                                        Community pharmacist    high school
122                   Sound technician, broadcasting/film/video    high school
123                                            Catering manager    high school
124                                       Nutritional therapist doctoral title
125                                            Surveyor, mining    high school
126                                Designer, industrial/product           <NA>
127                    Geographical information systems officer    high school
128                                          Furniture designer     university
129                                       Multimedia programmer    high school
130                                       Pharmacist, community doctoral title
131                                                     Curator           <NA>
132                                             Event organiser   trade school
133                                             Energy engineer           <NA>
134                            Armed forces operational officer           <NA>
135                             Geophysicist/field seismologist    high school
136                           Sales promotion account executive           <NA>
137                    Conservation officer, historic buildings    high school
138                                        Engineer, electrical   trade school
139                                                Statistician doctoral title
140                             Sport and exercise psychologist           <NA>
141                                       Pharmacist, community    high school
142                                              Water engineer    high school
143                                              Data scientist     university
144                                   Commercial horticulturist    high school
145                                  Horticulturist, commercial           <NA>
146                                                   Homeopath    high school
147                                           Minerals surveyor   trade school
148                                                Cartographer   trade school
149                                      Programmer, multimedia           <NA>
150                                                  IT trainer   trade school
151                             Commercial/residential surveyor    high school
152                                              Water engineer    high school
153                                            Insurance broker     university
154                          Museum/gallery exhibitions officer           <NA>
155                                           Ceramics designer    high school
156                                             Camera operator    high school
157                                                   Paramedic     university
158                                      Fitness centre manager           <NA>
159                                                Immunologist    high school
160                                     Chief Executive Officer    high school
161                                          Purchasing manager           <NA>
162                                        Pharmacist, hospital     university
163                                             Physiotherapist    high school
164                                           Market researcher    high school
165                                         Marketing executive    high school
166                                     Horticulturist, amenity     university
167                                          Jewellery designer           <NA>
168                                                        Make    high school
169                                       Child psychotherapist     university
170                               Interior and spatial designer    high school
171                           Environmental health practitioner    high school
172                                 Health and safety inspector    high school
173                                        Broadcast journalist           <NA>
174                                              Health visitor    high school
175                                                      Dancer    high school
176                                               Lexicographer    high school
177                                           Psychiatric nurse    high school
178                                        Newspaper journalist     university
179                                  Research scientist (maths)    high school
180                               Restaurant manager, fast food   trade school
181                                           Software engineer           <NA>
182                                        Engineer, electrical    high school
183                                 Advertising account planner    high school
184                                               Ranger/warden           <NA>
185 Scientist, clinical (histocompatibility and immunogenetics)           <NA>
186                                            Health physicist     university
187                                      Special effects artist     university
188                                         Hospital pharmacist     university
189                                Medical sales representative           <NA>
190                                    Technical sales engineer     university
191                                     Engineer, manufacturing    high school
192                            Surveyor, commercial/residential   trade school
193                                    Scientist, water quality           <NA>
194                                  Museum/gallery conservator    high school
195                                      Psychologist, forensic    high school
196                                        Optician, dispensing    high school
197                                                  Translator           <NA>
198                                          Secretary, company           <NA>
199                                                   Economist     university
200                                         Marketing executive    high school
    pol_immigration pol_environment pol_redistribution pol_eu_integration
1                 5               4                  5                  2
2                 3               5                  3                  4
3                 2               1                  2                  5
4                 2               3                  5                  1
5                 2               4                  3                  3
6                 2               5                  5                  4
7                 2               3                  3                  2
8                 5               5                  4                  5
9                 1               3                  1                  3
10                1               3                  3                  5
11                4               4                  2                  2
12                5               4                  2                  3
13                3               2                  3                  2
14                3               2                  5                  3
15                5               1                  2                  3
16                4               5                  5                  5
17                2               3                  4                  3
18                1               2                  1                  1
19                1               2                  4                  4
20                3               2                  3                  1
21                5               3                  5                  1
22                4               2                  4                  4
23                1               2                  1                  3
24                5               5                  4                  1
25                5               5                  3                  4
26                2               2                  2                  3
27                4               2                  2                  4
28                5               1                  1                  1
29                2               2                  3                  2
30                3               3                  5                  3
31                3               2                  4                  1
32                1               1                  5                  2
33                5               3                  5                  1
34                3               2                  3                  4
35                4               1                  1                  5
36                3               3                  3                  4
37                1               4                  4                  5
38                2               1                  5                  5
39                1               2                  1                  3
40                4               1                  5                  5
41                4               2                  1                  3
42                3               3                  3                  4
43                5               2                  4                  2
44                2               3                  4                  5
45                2               2                  3                  5
46                4               5                  4                  3
47                2               1                  3                  1
48                3               4                  4                  2
49                5               4                  2                  3
50                4               5                  5                  2
51                5               3                  1                  1
52                4               1                  4                  1
53                3               3                  2                  3
54                3               5                  2                  5
55                5               2                  5                  2
56                5               3                  5                  2
57                3               1                  1                  4
58                1               5                  4                  4
59                1               4                  1                  4
60                5               2                  5                  3
61                4               2                  3                  1
62                1               5                  4                  4
63                4               2                  3                  2
64                5               5                  5                  2
65                5               1                  3                  5
66                2               2                  1                  5
67                1               3                  3                  5
68                2               3                  3                  3
69                3               5                  5                  5
70                4               4                  2                  4
71                1               3                  5                  4
72                2               5                  1                  3
73                1               3                  1                  5
74                2               5                  5                  3
75                5               4                  5                  2
76                5               1                  5                  2
77                5               3                  5                  2
78                2               1                  1                  4
79                5               3                  2                  3
80                3               3                  4                  5
81                3               4                  1                  2
82                5               3                  3                  5
83                2               3                  5                  4
84                3               3                  5                  1
85                4               5                  1                  3
86                1               3                  5                  4
87                2               3                  3                  1
88                2               1                  2                  3
89                4               1                  2                  1
90                1               3                  3                  5
91                4               4                  4                  3
92                3               3                  2                  5
93                4               1                  4                  3
94                3               3                  3                  2
95                1               5                  2                  2
96                2               3                  1                  5
97                4               5                  5                  2
98                4               5                  5                  4
99                3               2                  2                  2
100               2               5                  2                  2
101               5               4                  2                  5
102               2               5                  4                  2
103               2               3                  2                  3
104               5               4                  3                  1
105               2               3                  4                  1
106               1               4                  3                  1
107               3               5                  4                  3
108               5               2                  2                  1
109               3               4                  2                  1
110               2               1                  2                  5
111               2               5                  2                  1
112               1               3                  2                  4
113               4               5                  4                  4
114               3               3                  1                  5
115               3               5                  5                  4
116               2               3                  3                  4
117               4               4                  4                  1
118               1               4                  4                  4
119               1               4                  3                  5
120               1               5                  1                  1
121               2               1                  5                  2
122               2               5                  2                  2
123               1               1                  5                  5
124               1               1                  4                  2
125               3               5                  4                  3
126               5               3                  1                  5
127               4               3                  3                  3
128               2               4                  5                  4
129               1               2                  2                  3
130               2               5                  4                  3
131               4               3                  1                  1
132               3               5                  5                  1
133               2               3                  3                  2
134               2               5                  3                  4
135               3               4                  2                  1
136               5               1                  2                  5
137               3               2                  4                  1
138               1               4                  5                  4
139               5               5                  1                  1
140               4               5                  2                  3
141               3               1                  1                  5
142               5               2                  1                  3
143               4               1                  5                  1
144               5               4                  4                  5
145               4               1                  2                  1
146               4               4                  4                  1
147               2               4                  3                  2
148               1               1                  2                  2
149               3               2                  4                  4
150               4               4                  3                  2
151               1               3                  4                  3
152               1               2                  4                  2
153               2               4                  2                  5
154               3               5                  5                  2
155               4               4                  1                  4
156               1               3                  1                  3
157               4               2                  2                  3
158               5               2                  3                  4
159               1               2                  4                  2
160               1               2                  3                  3
161               1               5                  3                  1
162               5               1                  1                  1
163               3               5                  1                  5
164               2               3                  2                  1
165               4               4                  3                  1
166               2               5                  3                  5
167               1               2                  1                  2
168               2               1                  2                  3
169               1               3                  5                  5
170               1               1                  4                  3
171               2               4                  4                  5
172               1               2                  5                  1
173               2               1                  1                  5
174               2               5                  4                  5
175               5               2                  1                  3
176               2               1                  4                  2
177               5               4                  3                  2
178               1               4                  2                  5
179               3               4                  2                  5
180               5               2                  1                  2
181               3               4                  2                  1
182               1               3                  5                  5
183               1               3                  4                  4
184               3               3                  1                  2
185               1               3                  3                  2
186               5               4                  3                  5
187               1               3                  4                  1
188               1               5                  1                  4
189               2               1                  4                  5
190               5               3                  4                  4
191               4               1                  1                  3
192               3               5                  4                  4
193               1               1                  5                  1
194               4               2                  3                  3
195               4               1                  1                  2
196               5               5                  4                  5
197               5               3                  3                  4
198               2               5                  1                  5
199               4               3                  3                  5
200               5               5                  5                  5

Finally, I generalize rare religion categories using fct_lump_min() from forcats. Since religion is not a keyVar in the sdcObject, I apply this directly to the dataset.

# Inspect counts first
table(data_nonpert$religion)

         Buddhism       Catholicism Eastern Orthodoxy             Islam 
                3                56                 1                13 
          Judaism              None     Protestantism 
                2                75                50 
data_nonpert <- data_nonpert %>%
  mutate(religion = fct_lump_min(as.factor(religion), min = 10, other_level = "Other"))

table(data_nonpert$religion)

  Catholicism         Islam          None Protestantism         Other 
           56            13            75            50             6 

fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes.

Finally, I also save the sdcObject.

saveRDS(sdc_nonpert, "../sdc_nonpert.rds")

Learning Objective

  • After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
  • After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques

Exercises

  • choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)
Back to top

References

Carvalho, Tânia, Nuno Moniz, Pedro Faria, and Luís Antunes. 2023. “Survey on Privacy-Preserving Techniques for Microdata Publication.” ACM Computing Surveys 55 (14s): 1–42. https://doi.org/10.1145/3588765.
Raghunathan, Balaji. 2013. The Complete Book of Data Anonymization: From Planning to Implementation. CRC Press.