Non-perturbative techniques reduce re-identification risk without distorting the underlying values - they either remove information or make it less specific. What stays in the dataset is still true; there is just less of it. This is what sets them apart from perturbative techniques, which add noise or swap values to obscure the original data.
Non-perturbative techniques include all techniques referred to as data masking, suppression, and deletion.
Non-perturbative techniques are usually the right starting point. They are easy to explain, easy to document, and easy to verify. For many research datasets, they are enough on their own.
Depending on the specific data type and risk, different techniques are possible (Carvalho et al. 2023):
Examples of Non-Perturbative Techniques
explain each technique, use research-related terms, and add visualizations
Global Recoding
Global recoding (also called generalization) replaces the exact values of a variable with broader categories - applied to every row in the dataset.
Example: Replace specific countries with world regions (e.g., “Germany”, “France” → “Western Europe”), recoding age in year to age in decade
The key decision is how broadly to generalize. Wider categories reduce risk more, but also lose more information. The right granularity depends on how many people share each combination - which is exactly what k-anonymity measures. You will practice this trade-off in the exercise below.
Local Recoding
Local recoding generalizes values only where needed, rather than across the whole dataset. This is useful when only a small subset of values is rare enough to create risk.
Example: If only few participant are from Asia, you may recode only their values (e.g., “China”, “India”, “Nepal”) to “Asia”, keeping all others unchanged.
This means, that values are on different level (in this example: country vs. continent). Compared to global recoding, local recoding is more targeted and loses less information, but it can be harder to justify and document consistently - you need to be transparent about which categories were collapsed and why.
Top-and-Bottom Coding
Top-and-bottom coding is a special case of recoding that applies to numerical variables. Instead of grouping all values into bands, you cap the extreme ends of the distribution - replacing any value above a threshold with that threshold (top coding), or below it (bottom coding). This is useful when the bulk of your data is unremarkable, but a few unusual values are risky because they belong to a small number of people. In our dataset, very high incomes or very long job tenures might apply to only one or two participants, making them easy to single out. Capping income at the 95th percentile, for example, means the highest earners all share the same reported value.
Example: Instead of reporting “6 children”, report “4+ children” for anyone above this threshold; all participants below the age of 20 years are coded as “<20 years”
How to determine the threshold: The threshold can be calculated, but this is not trivial. A common approach is to look at the distribution of the variable and choose a threshold where the remaining values above (or below) it are too few to guarantee anonymity. For example, if only 3 people in your dataset are older than 80, you might top-code at 80. You can also use percentiles - capping at the 95th or 99th percentile is a common starting point. The right threshold depends on your data and the level of k-anonymity you want to achieve.
Suppression
Suppression means removing information entirely - replacing a value with a missing value (NA) or dropping it altogether. This is also known as nulling. It is the most straightforward way to handle data that is too risky to release, but it also loses the most information.
Suppression can occur on different levels:
Cell suppression: Remove a single risky value while keeping the rest of the record. For example, if one participant has a unique job title that makes them identifiable, you can set only that cell to NA.
Record suppression: Remove an entire participant’s data if their combination of attributes is so unique that no other technique can adequately protect them.
Variable suppression: Drop a whole variable from the released dataset if it is too risky to include at all. Direct identifiers like name and email address are a clear case - we already did this in Chapter 1.3.
Cell suppression can also follow a specific condition: Conditional Nulling means replacing data entries with zeros based on a condition. For example, entries in the column “Customer Feedback” could be nullified if the customer feedback was negative (Raghunathan 2013). Find better example
Another example is the replacement of credit card number values with XXX. In the credit card example, the application of the character masking technique can be partial, hence only the first 9 numbers are replaced. Further, the overall number of characters stays the same, hence: XXXX XXXX XXXX 1234.
Suppression is often used as a last resort after recoding: if recoding alone is not enough to bring all records up to the required k-anonymity threshold, targeted cell suppression can close the remaining gaps.
Sampling
If your dataset covers an entire population - for example, all students enrolled in a particular program, or all employees of a specific company - then releasing it in full means that anyone who knows a person was part of that group can try to find their record. Releasing only a random sample of the data breaks this assumption: an attacker can no longer be certain whether a given individual is even in the released dataset.
A sample can be randomly selected, but also based on conditions (e.g., quotas for study subjects).
Sampling is most relevant for administrative datasets and registers. For typical survey-based research, where participation is voluntary and not exhaustively documented, it is less commonly needed. If you do use it, keep in mind that the sample size affects both privacy and the statistical representativeness of the data.
Could aslo be quite effective in case of participation knowledge
Pro and Contra Using Non-Perturbative Techniques
Pro:
original data values are preserved (no distortion of statistics within remaining cells)
straightforward to implement and explain
easy to document
Con:
information is lost (removed or coarsened)
heavy suppression can make the dataset hard to use
may not be sufficient on its own for high-risk data
Applying Non-Perturbative Techniques With R
basic data wrangling operations in R or functions in sdcMicro (better than manual changes)
but: when publishing anonymization script: needs to be anonymous as well (e.g., not naming specific countries; see Chapter DOCUMENTATION)
Exercise: Applying Non-Perturbative Techniques
In this exercise, you will apply non-perturbative techniques to the data to reduce re-identification risk.
The following sdcMicro functions are available to you:
globalRecode() — recodes a key variable into broader groups (e.g., exact age → age band)
topBotCoding() — caps extreme values at a given threshold
localSuppression() — suppresses individual cells in records that fall below a chosen k-anonymity threshold. I would be careful hwne using this function: It does a good job at finding individuals whose record is still unique. But it does not report changes transparently and leads to missing values. Instead, you can recode data manually, e.g., by using mutate, or by using a function such as fct_lump_min() from forcats. This function lumps all categories below a certain group size together.
Look at the key variables in sdc_data (gender, age, education, plz) and decide which technique is appropriate for each. There is no single correct solution - choose thresholds and groupings that make sense given the data, and be prepared to justify your choices.
Apply these techniques using the functions mentioned above.
Compare the new risk summary to the one from the previous exercise.
Tip
The sdcMicro functions operate directly on the sdc object and update risk estimates automatically. Check ?globalRecode, ?topBotCoding, and ?localSuppression for argument details.
The forcat function fct_lump_min() is applied to the dataset itself. The sdcObject needs to be updated afterward.
Re-read for roter Faden
TipSolution
Keep in mind that this is just an example solution - there are many ways to achieve similar levels of privacy.
I started by applying global recoding to age:
sdc_nonpert <-globalRecode(obj = sdc_data,column ="age",breaks =c(17, 29, 39, 49, 59, Inf), # Define breakslabels =c("18-29", "30-39", "40-49", "50-59", "60+")) sdc_nonpert # check summary after making the change
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 3 (3) 66.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 40.000 (40.000)
plz 152 (152) 1.316 (1.316)
Size of smallest (>0)
<char> <char>
10 (10)
23 (1)
9 (9)
1 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 180 (90.000%) | in original data: 198 (99.000%)
- 3-anonymity: 200 (100.000%) | in original data: 200 (100.000%)
- 5-anonymity: 200 (100.000%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 100.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 0.00
Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type ="risk") # check risk after making the change
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 0
in original data: 0
Expected number of re-identifications:
in modified data: 190.00 (95.00 %)
in original data: 199.00 (99.50 %)
I started by dividing the age into five about 10-year breaks. This provides a mean of 40 persons per age break, while the smallest one still has 23. These breaks can be changed later if more privacy or more utility is needed.
We see that k-anonymity values have not been influenced by the recoding of age so far - this is primarily due to the plz variable that is unique for most participants. I therefore decided to generalize the postal code.
As a first step, I used globalRecode() to group postal codes into five broad regions:
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 3 (3) 66.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 40.000 (40.000)
plz 5 (152) 40.000 (1.316)
Size of smallest (>0)
<char> <char>
10 (10)
23 (1)
9 (9)
21 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 65 (32.500%) | in original data: 198 (99.000%)
- 3-anonymity: 105 (52.500%) | in original data: 200 (100.000%)
- 5-anonymity: 149 (74.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 100.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 0.00
Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 0
in original data: 0
Expected number of re-identifications:
in modified data: 107.00 (53.50 %)
in original data: 199.00 (99.50 %)
The new categories map loosely to north-east (00-19), north-central (20-39), west (40-59), south-west (60-79), and south-east (80-99) Germany.
Looking at k-anonymity now, we see that there are still many unique observations. To reduce risk further, I will generalize plz even more. For this dataset, I mostly want to use plz to distinguish whether participants live in the Munich area or elsewhere in Germany. Postal codes starting with 8 cover roughly the areas surrounding Munich, so I will collapse plz into just two categories: "8xxxx" and "other".
globalRecode() cannot do this directly since cut() only handles contiguous intervals - and “other” spans two separate ranges (00000-79999 and 90000-99999). Instead, I directly modify the @manipKeyVars slot of the sdcObject and recalculate risks:
library(forcats)sdc_nonpert@manipKeyVars$plz <-fct_collapse( sdc_nonpert@manipKeyVars$plz,"8xxxx"=c("80-99"),"other"=c("00-19", "20-39", "40-59", "60-79"))sdc_nonpert <-calcRisks(sdc_nonpert) # After these manual changes, you have to update the risks on the sdcObjectsdc_nonpert
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 3 (3) 66.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 40.000 (40.000)
plz 2 (152) 100.000 (1.316)
Size of smallest (>0)
<char> <char>
10 (10)
23 (1)
9 (9)
91 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 32 (16.000%) | in original data: 198 (99.000%)
- 3-anonymity: 60 (30.000%) | in original data: 200 (100.000%)
- 5-anonymity: 105 (52.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 100.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 0.00
Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
print(sdc_nonpert, type ="risk")
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 32
in original data: 0
Expected number of re-identifications:
in modified data: 74.00 (37.00 %)
in original data: 199.00 (99.50 %)
Currently, we are at only 32 unique persons in our dataset and 60 persons that violate 2-anonymity. I don’t want to generalize the keyVars any further and for that reason decide to apply local suppression.
sdc_nonpert <-localSuppression( sdc_nonpert,k =2, # define wanted k-anonymity level, k = 2 is the defaultimportance =c(2, 1, 4, 3) # this defines the order of importance; the algorithm begins suppressing on the variable marked as 4 )sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 66.000 (66.667)
age 5 (52) 40.000 (3.846)
education 6 (5) 34.400 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
8 (10)
23 (1)
2 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 16 (8.000%) | in original data: 200 (100.000%)
- 5-anonymity: 59 (29.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 100.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 0.00
Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 16
in original data: 0
Expected number of re-identifications:
in modified data: 40.23 (20.12 %)
in original data: 199.00 (99.50 %)
The summary of the sdcObject shows us what the last local suppression step has achieved: It suppressed 28 values on the education variable and 2 each on gender and postal code. Now, 2-anonymity is reached for all participants. 16 entries violate 3-anonymity; I will try local suppression at k = 3 to see how much more suppression would be necessary to achieve this.
sdc_nonpert <-localSuppression( sdc_nonpert,k =3, # define wanted k-anonymity level, k = 2 is the defaultimportance =c(2, 1, 4, 3) # this defines the order of importance; the algorithm begins suppressing on the variable marked as 4 )sdc_nonpert
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 100.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 0.00
Difference of Eigenvalues: 0.000%
----------------------------------------------------------------------
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 29
in original data: 0
Expected number of re-identifications:
in modified data: 30.27 (15.13 %)
in original data: 199.00 (99.50 %)
With 12 more suppressions on education and 4 on gender, we can achieve 3-anonymity. Only 29 entries remain, that violate 5-anonymity. To me, this is enough protection on k-anonymity.
Now that the key variables are in their final state, I applied top-coding to income and years in job.
My goal with income is to protect the few individuals with a very high income. I start by inspecting the distribution:
I see a few extreme outliers above 400 000 €. Choosing the 95th percentile at 8.9185848^{4}€ cuts these off. I’ll use topBotCoding() and set the value at the 95th percentile, replacing those values with that one.
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 95.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 346.87
Difference of Eigenvalues: 6.600%
----------------------------------------------------------------------
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 29
in original data: 0
Expected number of re-identifications:
in modified data: 30.27 (15.13 %)
in original data: 199.00 (99.50 %)
For years in job, I start by familiarizing myself with the data:
The input dataset consists of 200 rows and 14 variables.
--> Categorical key variables: gender, age, education, plz
--> Numerical key variables: income, years_in_job
----------------------------------------------------------------------
Information on categorical key variables:
Reported is the number, mean size and size of the smallest category >0 for recoded variables.
In parenthesis, the same statistics are shown for the unmodified data.
Note: NA (missings) are counted as seperate categories!
Key Variable Number of categories Mean size
<char> <char> <char> <char> <char>
gender 4 (3) 64.667 (66.667)
age 5 (52) 40.000 (3.846)
education 5 (5) 32.000 (40.000)
plz 3 (152) 99.000 (1.316)
Size of smallest (>0)
<char> <char>
4 (10)
23 (1)
5 (9)
90 (1)
Infos on 2/3-Anonymity:
Number of observations violating
- 2-anonymity: 0 (0.000%) | in original data: 198 (99.000%)
- 3-anonymity: 0 (0.000%) | in original data: 200 (100.000%)
- 5-anonymity: 29 (14.500%) | in original data: 200 (100.000%)
----------------------------------------------------------------------
Numerical key variables: income, years_in_job
Disclosure risk (~100.00% in original data):
modified data: [0.00%; 92.00%]
Current Information Loss in modified data (0.00% in original data):
IL1: 495.71
Difference of Eigenvalues: 6.200%
----------------------------------------------------------------------
Risk measures:
Number of observations with higher risk than the main part of the data:
in modified data: 29
in original data: 0
Expected number of re-identifications:
in modified data: 30.27 (15.13 %)
in original data: 199.00 (99.50 %)
These techniques reduce the disclosure risk to up to 92% compared tot the original data.
Now I can extract the dataset. Because sdcMicro stores recoded key variables as integer codes internally, I need to restore the labels manually using mutate:
X id plz gender age income years_in_job religion
1 1 1 other female 18-29 8471.35 3 Catholicism
2 2 2 other female 30-39 34721.35 4 None
3 3 3 other male 60+ 89185.85 0 None
4 4 4 other female 60+ 59490.47 5 None
5 5 5 8xxxx male 18-29 52337.39 1 Islam
6 6 6 other male 60+ 60960.23 6 Protestantism
7 7 7 other male 30-39 61037.20 11 None
8 8 8 8xxxx female 30-39 58720.93 2 Buddhism
9 9 9 8xxxx male 30-39 55625.01 1 Catholicism
10 10 10 other male 18-29 75996.23 4 Catholicism
11 11 11 <NA> non-binary 40-49 29760.83 6 Catholicism
12 12 12 8xxxx female 30-39 43369.95 9 None
13 13 13 other male 30-39 70224.28 3 None
14 14 14 other female 40-49 43707.07 2 Islam
15 15 15 other female 60+ 50073.94 5 Islam
16 16 16 other female 30-39 53897.71 9 Catholicism
17 17 17 8xxxx male 50-59 47372.98 8 Catholicism
18 18 18 8xxxx female 60+ 53557.83 8 Judaism
19 19 19 other male 30-39 88933.73 4 Islam
20 20 20 8xxxx female 50-59 33516.81 2 None
21 21 21 other female 50-59 60266.05 3 None
22 22 22 other female 18-29 3234.65 11 None
23 23 23 other female 40-49 18647.76 0 Protestantism
24 24 24 other female 40-49 53198.80 0 Catholicism
25 25 25 other male 50-59 66750.89 8 Protestantism
26 26 26 8xxxx female 40-49 33275.84 2 Catholicism
27 27 27 other male 40-49 43974.11 0 Catholicism
28 28 28 other female 30-39 56904.06 2 None
29 29 29 other female 18-29 49785.63 2 Islam
30 30 30 8xxxx female 18-29 65838.27 4 None
31 31 31 8xxxx female 40-49 43510.82 2 Catholicism
32 32 32 other female 18-29 35050.43 11 Protestantism
33 33 33 other male 18-29 45612.60 6 None
34 34 34 8xxxx female 18-29 74064.45 2 Islam
35 35 35 8xxxx female 18-29 19700.02 1 None
36 36 36 other female 18-29 35491.71 1 Protestantism
37 37 37 8xxxx female 18-29 31146.67 3 Protestantism
38 38 38 other female 18-29 58538.33 1 None
39 39 39 8xxxx <NA> 30-39 25565.51 10 None
40 40 40 other male 50-59 53781.80 12 None
41 41 41 8xxxx male 30-39 48621.36 6 None
42 42 42 8xxxx female 50-59 52099.26 1 None
43 43 43 8xxxx male 18-29 80544.39 5 None
44 44 44 8xxxx male 40-49 44352.87 0 None
45 45 45 8xxxx male 18-29 36923.45 5 Catholicism
46 46 46 other female 60+ 89185.85 8 Catholicism
47 47 47 8xxxx male 50-59 82695.22 6 None
48 48 48 8xxxx male 60+ 89185.85 15 Protestantism
49 49 49 8xxxx <NA> 18-29 43699.08 3 Protestantism
50 50 50 8xxxx male 18-29 55256.89 8 Catholicism
51 51 51 8xxxx female 18-29 57280.41 2 Protestantism
52 52 52 other female 40-49 65257.71 15 Catholicism
53 53 53 other male 30-39 73130.44 5 Protestantism
54 54 54 8xxxx female 50-59 77231.91 5 Islam
55 55 55 other non-binary 40-49 89185.85 4 None
56 56 56 8xxxx female 50-59 49240.52 6 None
57 57 57 8xxxx male 18-29 49181.85 1 Protestantism
58 58 58 8xxxx female 40-49 45667.06 11 Catholicism
59 59 59 8xxxx male 18-29 71020.92 3 Catholicism
60 60 60 other female 50-59 17867.79 0 None
61 61 61 other male 30-39 59154.65 7 None
62 62 62 8xxxx male 18-29 34867.09 1 Protestantism
63 63 63 8xxxx male 60+ 89185.85 9 Protestantism
64 64 64 8xxxx female 40-49 6537.27 4 Islam
65 65 65 8xxxx <NA> 18-29 53963.61 5 Protestantism
66 66 66 other male 30-39 53233.39 0 None
67 67 67 other female 60+ 35192.39 0 Catholicism
68 68 68 8xxxx male 40-49 88998.52 7 Protestantism
69 69 69 8xxxx female 30-39 76185.54 15 None
70 70 70 8xxxx male 18-29 15277.69 1 Catholicism
71 71 71 8xxxx female 50-59 32358.41 4 Protestantism
72 72 72 8xxxx female 50-59 59564.29 0 Catholicism
73 73 73 8xxxx male 18-29 44946.10 1 None
74 74 74 8xxxx male 50-59 66701.73 10 Judaism
75 75 75 8xxxx female 30-39 52378.45 13 Protestantism
76 76 76 8xxxx female 30-39 28532.02 1 Protestantism
77 77 77 other male 40-49 56000.53 5 Protestantism
78 78 78 8xxxx male 18-29 69690.09 11 None
79 79 79 8xxxx female 50-59 46431.85 2 Protestantism
80 80 80 8xxxx female 60+ 85894.23 4 None
81 81 81 other female 40-49 59395.77 3 Protestantism
82 82 82 8xxxx female 18-29 50677.57 0 Protestantism
83 83 83 other male 60+ 19523.64 3 Islam
84 84 84 other female 30-39 60402.28 5 Protestantism
85 85 85 8xxxx male 40-49 73762.07 15 None
86 86 86 other male 50-59 57488.83 2 Catholicism
87 87 87 other female 30-39 48387.34 4 Catholicism
88 88 88 other female 40-49 51020.65 15 Catholicism
89 89 89 8xxxx <NA> 30-39 89185.85 8 None
90 90 90 other female 40-49 52973.81 5 Catholicism
91 91 91 8xxxx female 18-29 65059.84 5 Protestantism
92 92 92 other female 40-49 89185.85 5 Catholicism
93 93 93 other female 50-59 60871.39 10 Catholicism
94 94 94 other male 18-29 38100.28 1 None
95 95 95 other female 60+ 1508.25 1 None
96 96 96 8xxxx female 40-49 35839.71 2 Catholicism
97 97 97 8xxxx male 60+ 33524.46 3 None
98 98 98 other female 40-49 89185.85 10 Protestantism
99 99 99 other male 40-49 37535.39 15 None
100 100 100 other male 40-49 70163.08 3 Catholicism
101 101 101 8xxxx male 18-29 44974.20 2 Buddhism
102 102 102 other male 30-39 65006.97 5 Catholicism
103 103 103 8xxxx male 40-49 28030.40 13 Protestantism
104 104 104 8xxxx male 40-49 56143.31 1 Protestantism
105 105 105 8xxxx male 18-29 21741.10 12 Catholicism
106 106 106 other female 18-29 61855.98 1 None
107 107 107 other female 18-29 15326.24 1 None
108 108 108 8xxxx male 60+ 9709.52 5 None
109 109 109 other female 40-49 25659.34 5 Catholicism
110 110 110 other female 50-59 69044.93 3 Catholicism
111 111 111 8xxxx male 30-39 66301.02 3 Protestantism
112 112 112 8xxxx female 30-39 27090.40 2 None
113 113 113 other male 40-49 45324.62 4 None
114 114 114 other male 60+ 50849.11 5 Catholicism
115 115 115 8xxxx female 60+ 30421.29 13 None
116 116 116 other female 40-49 50230.45 3 Protestantism
117 117 117 8xxxx male 18-29 37902.42 6 None
118 118 118 other female 18-29 55949.20 1 Catholicism
119 119 119 other male 40-49 53509.42 0 None
120 120 120 8xxxx female 60+ 57558.17 2 Catholicism
121 121 121 other male 30-39 67904.73 3 Protestantism
122 122 122 other male 18-29 37003.44 1 None
123 123 123 8xxxx female 40-49 74012.32 0 None
124 124 124 8xxxx male 40-49 54693.74 0 None
125 125 125 other female 60+ 36977.03 12 None
126 126 126 other male 30-39 36546.40 2 None
127 127 127 other male 30-39 84045.58 10 Catholicism
128 128 128 other female 18-29 82138.06 1 None
129 129 129 8xxxx female 50-59 30733.31 1 None
130 130 130 8xxxx male 40-49 61673.59 3 Protestantism
131 131 131 8xxxx female 50-59 69536.55 1 Protestantism
132 132 132 8xxxx male 30-39 48347.66 6 Protestantism
133 133 133 8xxxx male 30-39 66190.75 0 Protestantism
134 134 134 other male 60+ 66078.75 11 None
135 135 135 8xxxx female 50-59 18269.69 1 None
136 136 136 other non-binary 40-49 59707.29 1 Protestantism
137 137 137 8xxxx male 40-49 52099.71 15 None
138 138 138 8xxxx female 18-29 34656.89 4 Protestantism
139 139 139 8xxxx male 40-49 34257.44 10 Catholicism
140 140 140 8xxxx female 30-39 24074.20 14 None
141 141 141 8xxxx male 40-49 64152.22 1 Catholicism
142 142 142 other female 40-49 83251.70 2 Protestantism
143 143 143 8xxxx female 40-49 64440.78 13 Protestantism
144 144 144 8xxxx female 18-29 47105.72 9 None
145 145 145 other female 40-49 36879.16 3 Catholicism
146 146 146 other male 30-39 42220.17 15 Protestantism
147 147 147 8xxxx male 30-39 63960.04 0 None
148 148 148 8xxxx male 50-59 3530.08 9 None
149 149 149 other female 30-39 13668.60 4 Islam
150 150 150 other male 40-49 50377.45 7 Protestantism
151 151 151 8xxxx female 40-49 61128.01 3 Catholicism
152 152 152 8xxxx male 18-29 43719.31 4 None
153 153 153 other female 40-49 54589.13 10 None
154 154 154 other female 50-59 19115.90 10 None
155 155 155 8xxxx female 40-49 10024.02 2 Catholicism
156 156 156 other male 18-29 85761.37 7 Catholicism
157 157 157 <NA> <NA> 60+ 82607.20 3 Catholicism
158 158 158 other male 50-59 44383.80 1 None
159 159 159 8xxxx female 30-39 66185.55 15 Catholicism
160 160 160 other male 50-59 46443.35 11 Protestantism
161 161 161 8xxxx female 50-59 41571.20 9 Protestantism
162 162 162 other female 40-49 48002.36 1 Catholicism
163 163 163 8xxxx female 40-49 64303.37 0 Protestantism
164 164 164 8xxxx female 50-59 63828.73 10 Catholicism
165 165 165 other male 18-29 62158.38 1 Islam
166 166 166 8xxxx female 30-39 36173.97 6 None
167 167 167 other female 40-49 22233.87 0 Islam
168 168 168 8xxxx male 18-29 65795.96 2 Catholicism
169 169 169 other female 18-29 50500.79 1 None
170 170 170 other female 40-49 34635.22 6 None
171 171 171 8xxxx female 60+ 51093.90 11 None
172 172 172 8xxxx male 30-39 38140.07 13 None
173 173 173 other non-binary 40-49 83316.86 2 Catholicism
174 174 174 other female 40-49 89185.85 10 None
175 175 175 8xxxx male 40-49 47266.68 13 Protestantism
176 176 176 8xxxx female 18-29 38736.53 5 Catholicism
177 177 177 8xxxx male 30-39 51254.30 9 Catholicism
178 178 178 8xxxx female 40-49 3070.65 2 Islam
179 179 179 8xxxx male 50-59 67298.52 7 None
180 180 180 8xxxx male 40-49 52914.99 6 Protestantism
181 181 181 other female 60+ 53761.55 3 None
182 182 182 8xxxx male 40-49 62667.11 10 Catholicism
183 183 183 8xxxx female 18-29 69837.87 1 Protestantism
184 184 184 other male 30-39 48027.70 9 None
185 185 185 8xxxx male 50-59 68760.98 6 Catholicism
186 186 186 8xxxx female 18-29 53996.54 6 None
187 187 187 8xxxx <NA> 60+ 29089.31 2 Protestantism
188 188 188 8xxxx female 18-29 58074.39 0 Protestantism
189 189 189 8xxxx male 30-39 43647.51 5 Buddhism
190 190 190 8xxxx female 40-49 59911.66 15 Catholicism
191 191 191 other female 40-49 15015.81 4 None
192 192 192 8xxxx male 30-39 48884.82 3 Catholicism
193 193 193 other female 30-39 11665.95 5 Protestantism
194 194 194 other male 40-49 51434.15 6 Eastern Orthodoxy
195 195 195 8xxxx male 30-39 73230.95 10 Catholicism
196 196 196 8xxxx male 50-59 52969.14 5 None
197 197 197 other female 30-39 38514.53 11 None
198 198 198 8xxxx female 40-49 89185.85 0 Protestantism
199 199 199 8xxxx female 30-39 58564.02 4 Catholicism
200 200 200 other female 40-49 30312.73 3 Catholicism
job_title education
1 Local government officer trade school
2 Structural engineer high school
3 Psychotherapist, dance movement high school
4 Fitness centre manager high school
5 Programme researcher, broadcasting/film/video high school
6 Chief Strategy Officer high school
7 Engineer, communications high school
8 Secretary/administrator university
9 Video editor trade school
10 Hotel manager high school
11 Herbalist university
12 Teacher, primary school high school
13 Production assistant, television high school
14 Surveyor, mining high school
15 Data scientist high school
16 Programmer, applications high school
17 Horticulturist, commercial trade school
18 Training and development officer high school
19 Catering manager high school
20 Textile designer <NA>
21 Designer, fashion/clothing high school
22 Medical physicist high school
23 Seismic interpreter <NA>
24 Biomedical engineer high school
25 Biomedical engineer <NA>
26 Technical brewer university
27 Advertising art director high school
28 Clothing/textile technologist high school
29 Stage manager university
30 Advertising art director high school
31 IT consultant university
32 Secondary school teacher trade school
33 Restaurant manager <NA>
34 Politician's assistant trade school
35 Illustrator high school
36 Community pharmacist high school
37 Solicitor <NA>
38 Local government officer high school
39 Osteopath <NA>
40 Medical illustrator high school
41 Surgeon trade school
42 Financial controller high school
43 Systems analyst high school
44 Sports therapist high school
45 Sound technician, broadcasting/film/video doctoral title
46 Amenity horticulturist high school
47 Firefighter <NA>
48 Copy high school
49 Sales promotion account executive trade school
50 Public relations officer trade school
51 Financial planner high school
52 Museum/gallery curator high school
53 Dealer high school
54 Higher education careers adviser high school
55 General practice doctor <NA>
56 Recruitment consultant <NA>
57 Training and development officer high school
58 Mudlogger high school
59 Pharmacist, hospital high school
60 Horticultural consultant high school
61 Ergonomist high school
62 Chartered accountant <NA>
63 Psychiatrist <NA>
64 Data processing manager high school
65 Broadcast journalist trade school
66 Administrator, charities/voluntary organisations high school
67 Speech and language therapist high school
68 Energy manager high school
69 Editor, commissioning high school
70 Advertising account executive high school
71 Surveyor, land/geomatics high school
72 Exercise physiologist high school
73 Risk manager trade school
74 Risk manager high school
75 Games developer high school
76 Illustrator university
77 Speech and language therapist <NA>
78 Gaffer high school
79 Heritage manager high school
80 Buyer, industrial high school
81 Psychiatrist university
82 Public house manager university
83 Production assistant, television <NA>
84 Wellsite geologist high school
85 Dealer high school
86 Manufacturing engineer high school
87 Engineer, electrical <NA>
88 Brewing technologist university
89 Social researcher trade school
90 Advertising account planner high school
91 Community pharmacist high school
92 Conservator, furniture <NA>
93 Designer, blown glass/stained glass <NA>
94 Scientist, product/process development high school
95 Dealer <NA>
96 Insurance underwriter high school
97 Curator high school
98 Armed forces logistics/support/administrative officer university
99 Chief Executive Officer <NA>
100 Special effects artist trade school
101 Quarry manager doctoral title
102 Therapist, sports high school
103 Chartered management accountant <NA>
104 Graphic designer trade school
105 Professor Emeritus high school
106 Runner, broadcasting/film/video high school
107 Forensic psychologist trade school
108 Engineer, electrical <NA>
109 Designer, furniture university
110 Engineer, manufacturing systems high school
111 Patent attorney high school
112 Research officer, trade union university
113 Museum/gallery conservator high school
114 Furniture conservator/restorer high school
115 Television floor manager university
116 Print production planner high school
117 Financial trader high school
118 Estate agent high school
119 Trading standards officer trade school
120 Housing manager/officer university
121 Community pharmacist high school
122 Sound technician, broadcasting/film/video high school
123 Catering manager high school
124 Nutritional therapist doctoral title
125 Surveyor, mining high school
126 Designer, industrial/product <NA>
127 Geographical information systems officer high school
128 Furniture designer university
129 Multimedia programmer high school
130 Pharmacist, community doctoral title
131 Curator <NA>
132 Event organiser trade school
133 Energy engineer <NA>
134 Armed forces operational officer <NA>
135 Geophysicist/field seismologist high school
136 Sales promotion account executive <NA>
137 Conservation officer, historic buildings high school
138 Engineer, electrical trade school
139 Statistician doctoral title
140 Sport and exercise psychologist <NA>
141 Pharmacist, community high school
142 Water engineer high school
143 Data scientist university
144 Commercial horticulturist high school
145 Horticulturist, commercial <NA>
146 Homeopath high school
147 Minerals surveyor trade school
148 Cartographer trade school
149 Programmer, multimedia <NA>
150 IT trainer trade school
151 Commercial/residential surveyor high school
152 Water engineer high school
153 Insurance broker university
154 Museum/gallery exhibitions officer <NA>
155 Ceramics designer high school
156 Camera operator high school
157 Paramedic university
158 Fitness centre manager <NA>
159 Immunologist high school
160 Chief Executive Officer high school
161 Purchasing manager <NA>
162 Pharmacist, hospital university
163 Physiotherapist high school
164 Market researcher high school
165 Marketing executive high school
166 Horticulturist, amenity university
167 Jewellery designer <NA>
168 Make high school
169 Child psychotherapist university
170 Interior and spatial designer high school
171 Environmental health practitioner high school
172 Health and safety inspector high school
173 Broadcast journalist <NA>
174 Health visitor high school
175 Dancer high school
176 Lexicographer high school
177 Psychiatric nurse high school
178 Newspaper journalist university
179 Research scientist (maths) high school
180 Restaurant manager, fast food trade school
181 Software engineer <NA>
182 Engineer, electrical high school
183 Advertising account planner high school
184 Ranger/warden <NA>
185 Scientist, clinical (histocompatibility and immunogenetics) <NA>
186 Health physicist university
187 Special effects artist university
188 Hospital pharmacist university
189 Medical sales representative <NA>
190 Technical sales engineer university
191 Engineer, manufacturing high school
192 Surveyor, commercial/residential trade school
193 Scientist, water quality <NA>
194 Museum/gallery conservator high school
195 Psychologist, forensic high school
196 Optician, dispensing high school
197 Translator <NA>
198 Secretary, company <NA>
199 Economist university
200 Marketing executive high school
pol_immigration pol_environment pol_redistribution pol_eu_integration
1 5 4 5 2
2 3 5 3 4
3 2 1 2 5
4 2 3 5 1
5 2 4 3 3
6 2 5 5 4
7 2 3 3 2
8 5 5 4 5
9 1 3 1 3
10 1 3 3 5
11 4 4 2 2
12 5 4 2 3
13 3 2 3 2
14 3 2 5 3
15 5 1 2 3
16 4 5 5 5
17 2 3 4 3
18 1 2 1 1
19 1 2 4 4
20 3 2 3 1
21 5 3 5 1
22 4 2 4 4
23 1 2 1 3
24 5 5 4 1
25 5 5 3 4
26 2 2 2 3
27 4 2 2 4
28 5 1 1 1
29 2 2 3 2
30 3 3 5 3
31 3 2 4 1
32 1 1 5 2
33 5 3 5 1
34 3 2 3 4
35 4 1 1 5
36 3 3 3 4
37 1 4 4 5
38 2 1 5 5
39 1 2 1 3
40 4 1 5 5
41 4 2 1 3
42 3 3 3 4
43 5 2 4 2
44 2 3 4 5
45 2 2 3 5
46 4 5 4 3
47 2 1 3 1
48 3 4 4 2
49 5 4 2 3
50 4 5 5 2
51 5 3 1 1
52 4 1 4 1
53 3 3 2 3
54 3 5 2 5
55 5 2 5 2
56 5 3 5 2
57 3 1 1 4
58 1 5 4 4
59 1 4 1 4
60 5 2 5 3
61 4 2 3 1
62 1 5 4 4
63 4 2 3 2
64 5 5 5 2
65 5 1 3 5
66 2 2 1 5
67 1 3 3 5
68 2 3 3 3
69 3 5 5 5
70 4 4 2 4
71 1 3 5 4
72 2 5 1 3
73 1 3 1 5
74 2 5 5 3
75 5 4 5 2
76 5 1 5 2
77 5 3 5 2
78 2 1 1 4
79 5 3 2 3
80 3 3 4 5
81 3 4 1 2
82 5 3 3 5
83 2 3 5 4
84 3 3 5 1
85 4 5 1 3
86 1 3 5 4
87 2 3 3 1
88 2 1 2 3
89 4 1 2 1
90 1 3 3 5
91 4 4 4 3
92 3 3 2 5
93 4 1 4 3
94 3 3 3 2
95 1 5 2 2
96 2 3 1 5
97 4 5 5 2
98 4 5 5 4
99 3 2 2 2
100 2 5 2 2
101 5 4 2 5
102 2 5 4 2
103 2 3 2 3
104 5 4 3 1
105 2 3 4 1
106 1 4 3 1
107 3 5 4 3
108 5 2 2 1
109 3 4 2 1
110 2 1 2 5
111 2 5 2 1
112 1 3 2 4
113 4 5 4 4
114 3 3 1 5
115 3 5 5 4
116 2 3 3 4
117 4 4 4 1
118 1 4 4 4
119 1 4 3 5
120 1 5 1 1
121 2 1 5 2
122 2 5 2 2
123 1 1 5 5
124 1 1 4 2
125 3 5 4 3
126 5 3 1 5
127 4 3 3 3
128 2 4 5 4
129 1 2 2 3
130 2 5 4 3
131 4 3 1 1
132 3 5 5 1
133 2 3 3 2
134 2 5 3 4
135 3 4 2 1
136 5 1 2 5
137 3 2 4 1
138 1 4 5 4
139 5 5 1 1
140 4 5 2 3
141 3 1 1 5
142 5 2 1 3
143 4 1 5 1
144 5 4 4 5
145 4 1 2 1
146 4 4 4 1
147 2 4 3 2
148 1 1 2 2
149 3 2 4 4
150 4 4 3 2
151 1 3 4 3
152 1 2 4 2
153 2 4 2 5
154 3 5 5 2
155 4 4 1 4
156 1 3 1 3
157 4 2 2 3
158 5 2 3 4
159 1 2 4 2
160 1 2 3 3
161 1 5 3 1
162 5 1 1 1
163 3 5 1 5
164 2 3 2 1
165 4 4 3 1
166 2 5 3 5
167 1 2 1 2
168 2 1 2 3
169 1 3 5 5
170 1 1 4 3
171 2 4 4 5
172 1 2 5 1
173 2 1 1 5
174 2 5 4 5
175 5 2 1 3
176 2 1 4 2
177 5 4 3 2
178 1 4 2 5
179 3 4 2 5
180 5 2 1 2
181 3 4 2 1
182 1 3 5 5
183 1 3 4 4
184 3 3 1 2
185 1 3 3 2
186 5 4 3 5
187 1 3 4 1
188 1 5 1 4
189 2 1 4 5
190 5 3 4 4
191 4 1 1 3
192 3 5 4 4
193 1 1 5 1
194 4 2 3 3
195 4 1 1 2
196 5 5 4 5
197 5 3 3 4
198 2 5 1 5
199 4 3 3 5
200 5 5 5 5
Finally, I generalize rare religion categories using fct_lump_min() from forcats. Since religion is not a keyVar in the sdcObject, I apply this directly to the dataset.
data_nonpert <- data_nonpert %>%mutate(religion =fct_lump_min(as.factor(religion), min =10, other_level ="Other"))table(data_nonpert$religion)
Catholicism Islam None Protestantism Other
56 13 75 50 6
fct_lump_min(religion, min = 10) merges any category with fewer than 10 observations into "Other". Groups like “Judaism” or “Buddhism” which have very few members in this dataset are natural candidates for collapsing, since a person could potentially be identified through their combination of rare religion and other attributes.
Finally, I also save the sdcObject.
saveRDS(sdc_nonpert, "../sdc_nonpert.rds")
Learning Objective
After completing this part of the tutorial, you will be able to choose an appropriate non-perturbative technique
After completing this part of the tutorial, you will be able to apply simple non-perturbative techniques
Exercises
choosing and applying non-perturbative techniques (data set with some demographics, task: anonymize age, gender, country of residence, e-mail addresses, etc., choose appropriate techniques); show example solution (with explanation for decisions and emphasis that there is no one right answer)