#install.packages(c("ggplot2", "DescTools"))
library(ggplot2)
library(DescTools)
Ch. 3: Generalized Linear Models
Reading/working time: ~35 min.
In the previous chapters, we have learned how to conduct simulation-based power analyses for linear regressions with categorical and/or continuous predictor variables. In this chapter, we will turn to generalized linear models, which constitute a generalization of regular linear regressions. This class of statistical models additionally comprise more complex models such as logistic regression and poisson regression. However, as covering all of variants of the generalized linear models would exceed the scope of this workshop, we will only learn how to conduct a simulation-based power analysis for a logistic regression. Recall that a logistic regression is suitable for designs in which you predict a dichotomous outcome with a set of (continuous or categorical) predictors.
A power analysis for a simple logistic regression with one categorical predictor
In order to learn how a simulation-based power analysis for logistic regressions work, we build on the previous chapters by continuing to work with the BtheB
data set. Please see the first chapter on linear models to get more info about this data set.
Let’s get some data as a starting point
# load the data
data("BtheB", package = "HSAUR")
# show which variables this data set includes
str(BtheB)
'data.frame': 100 obs. of 8 variables:
$ drug : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
$ length : Factor w/ 2 levels "<6m",">6m": 2 2 1 2 2 1 1 2 1 2 ...
$ treatment: Factor w/ 2 levels "TAU","BtheB": 1 2 1 2 2 2 1 1 2 2 ...
$ bdi.pre : num 29 32 25 21 26 7 17 20 18 20 ...
$ bdi.2m : num 2 16 20 17 23 0 7 20 13 5 ...
$ bdi.4m : num 2 24 NA 16 NA 0 7 21 14 5 ...
$ bdi.6m : num NA 17 NA 10 NA 0 3 19 20 8 ...
$ bdi.8m : num NA 20 NA 9 NA 0 7 13 11 12 ...
This data set compares the effects of two interventions for depression, that is a so-called “Beat the blues” intervention (BtheB
) and a “treatment-as-usual” intervention (TAU
). Note that in this chapter we will compare two active treatment conditions with each other - in the other chapters on linear models, we compare an active treatment with a passive waiting group.
Unfortunately, this data set does not include any dichotomous variable we could use as an outcome measure. Therefore, we simply dichotomize one of the continuous outcome variables to create a categorical outcome artificially. More specifically, we will dichotomize the bdi.2m
variable which contains a follow-up depression measure (i.e., the Beck Depression Inventory score) two months after the intervention.
Let’s start by dichotomizing the bdi.2m
variable and storing the result in a new variable which we label bdi.2m.dicho
. Here, we take 20 as our cut-off value, as BDIs of 20 or more refer to a moderate or severe depression. BDI values lower than 20, in turn, reflect a minimal or mild depression (see first chapter on linear models for more info). In our new bdi.2m.dicho
variable, we code minimal/mild depression as “0” and moderate/severe depression as “1”.
#dichotomize dv
$bdi.2m.dicho <- ifelse(BtheB$bdi.2m < 20, 0, 1)
BtheB
#show frequencies
table(BtheB$bdi.2m.dicho, BtheB$treatment)
TAU BtheB
0 21 37
1 24 15
The frequency table shows that there were less patients with moderate/severe depression in the TAU
treatment as compared to the BtheB
treatment. That’s a first sign that the BtheB intervention might outperform the treatment-as-usual!
In this chapter, we focus on a very simple version of a logistic regression: A model in which the dichotomous outcome variable is predicted by one categorical predictor variable, that is, the treatment condition (BtheB
vs. TAU
). Let’s assume that we plan to set up a study in which we want to scrutinize whether or not the “Beat the blues” intervention really outperforms the “treatment-as-usual” intervention with regard to the BDI scores two months after the end of the interventions. How many participants would we need for such a study to achieve 80% power?
Estimating the population parameters
As in the previous chapters, we first need to estimate all population parameters relevant for this study. More specifically, we will need three estimates:
- the proportion of participants in the
BtheB
condition - the probability of a favorable outcome in the
BtheB
condition - the probability of a favorable outcome in the
TAU
condition
The first parameter is pretty easy to estimate. In most cases, researchers will assign half of the participants to the treatment condition and the other half to a control condition. In this case, the probability of being in the BtheB
condition would be 50%. Let’s store that in a variable called prop_btheb
.
<- 0.5 prop_btheb
The other two estimates are only slightly more complicated to estimate. The probability of a favorable outcome in each of the conditions basically means: How many of the participants will not have a moderate/severe depression two months after completing the treatment-as-usual? And, how many of the participants will not have a moderate/severe depression two months after completing the Beat the blues intervention?
These two probabilities can only be estimated with solid pilot data. In our case, we can estimate both probabilities from the BtheB
data set. The following chunk does that, rounds the estimates to two decimals, and stores the estimates in two objects called prop_tau
and prop_btheb
.
<- round(length(BtheB$treatment[BtheB$treatment == "TAU" & BtheB$bdi.2m.dicho == 0 & !is.na(BtheB$bdi.2m.dicho)]) / length(BtheB$treatment[BtheB$treatment == "TAU" & !is.na(BtheB$bdi.2m.dicho)]),2)
prob_tau prob_tau
[1] 0.47
<- round(length(BtheB$treatment[BtheB$treatment == "BtheB" & BtheB$bdi.2m.dicho == 0 & !is.na(BtheB$bdi.2m.dicho)]) / length(BtheB$treatment[BtheB$treatment == "BtheB" & !is.na(BtheB$bdi.2m.dicho)]),2)
prob_btheb prob_btheb
[1] 0.71
In the BtheB
data set, the probability of not having a moderate/ severe depression in the BtheB
condition was 0.71 and the same probability in the TAU
condition was 0.47.
Now we have all we need to start simulating data. There are multiple ways to simulate this kind of data, but one very simple way is to apply the sample()
function. Here, we use it twice: Once to draw for the TAU
condition and once for the BtheB
condition (which differ in their probabilities of yielding a positive outcome, as we have seen before). For each condition, we draw whether or not the participant has a moderate/severe depression two months after the treatment, while coding “no moderate/severe depression” with 0 and “moderate/severe depression” with 1. Let’s try this by sampling 100 observations per condition.
set.seed(8526)
#sample from distribution in "TAU" condition
<- cbind(rep("TAU", 100), sample(x = c(0,1), replace = TRUE, prob = c(prob_tau, 1-prob_tau), size = 100))
tau
#sample from distribution in "BtheB" condition
<- cbind(rep("BtheB", 100), sample(x = c(0,1), replace = TRUE, prob = c(prob_btheb, 1-prob_btheb), size = 100)) btheb
We now have two data frames (labeled tau
and btheB
), one per condition. In the following chunk, we combine them into one data frame, change the variable names, transform the treatment variables to a factor, transform the outcome variable to an integer variable, and edit the factor levels – all of this is necessary to run our logistic regression on this data set later.
#combine data frames
<- rbind(tau, btheb) |> as.data.frame()
simulated_data
#set variable names
colnames(simulated_data) <- c("treatment", "bdi.2m.dicho")
#change treatment variable to factor
$treatment <- simulated_data$treatment |> as.factor()
simulated_data
#change outcome variable to integer
$bdi.2m.dicho <- simulated_data$bdi.2m.dicho |> as.integer()
simulated_data
#reverse factor levels, this affects the coding of this factor in the regression analysis later
$treatment <- factor(simulated_data$treatment, levels=rev(levels(simulated_data$treatment))) simulated_data
With this first simulated data set, we can now perform a first logistic regression.
<- glm(bdi.2m.dicho ~ treatment, data = simulated_data, family = "binomial")
simulated_fit
summary(simulated_fit)
Call:
glm(formula = bdi.2m.dicho ~ treatment, family = "binomial",
data = simulated_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3354 -0.8615 -0.8615 1.0273 1.5305
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3640 0.2033 1.790 0.0734 .
treatmentBtheB -1.1641 0.2968 -3.922 8.78e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 275.26 on 199 degrees of freedom
Residual deviance: 259.19 on 198 degrees of freedom
AIC: 263.19
Number of Fisher Scoring iterations: 4
Here, we find a negative and significant regression weight for the treatment predictor of -1.16, indicating that the BtheB
treatment led to less moderate/severe depressions as compared to the TAU
treatment. But, actually, it is not our central interest to test whether this particular simulated data set results in a significant treatment effect. What we really want to know is: How many of a theoretically infinite number of simulations yield a significant p-value of this effect? Thus, as in the previous chapters, we now repeatedly simulate data sets of a certain size from the specified population and store the results of the focal test (here: the p-value of the regression coefficient) in a vector called p_values
.
Let’s do the power analysis
#write a function to automize the data simulation process
<- function(n, p0, p1, prop = .50){
simulation
#sample from distribution in "TAU" condition
<- cbind(rep("TAU", n*prop), sample(x = c(0,1), replace = TRUE, prob = c(p0, 1-p0), size = n*prop))
tau
#sample from distribution in "BtheB" condition
<- cbind(rep("BtheB", n*prop), sample(x = c(0,1), replace = TRUE, prob = c(p1, 1-p1), size = n*prop))
btheb
#combine both data sets and do some preprocessing
<- rbind(tau, btheb) |> as.data.frame()
simulated_data colnames(simulated_data) <- c("treatment", "bdi.2m.dicho")
$treatment <- simulated_data$treatment |> as.factor()
simulated_data$bdi.2m.dicho <- simulated_data$bdi.2m.dicho |> as.numeric()
simulated_data$treatment <-factor(simulated_data$treatment, levels=rev(levels(simulated_data$treatment)))
simulated_datareturn(simulated_data)
}
#prepare empty vector to store the p-values
<- NULL
p_value
#prepare empty vector to store the results (i.e., the power per sample size)
<- data.frame()
results
# write function to store results of simulation
<- function(n, p0, p1, prop = .50){
sim
#simulate data
<- simulation(n = n, p0 = p0, p1 = p1, prop = .50)
data
#run regression
<- glm(bdi.2m.dicho ~ treatment, data = data, family = "binomial")
simulated_fit
#store p-value
<- coef(summary(simulated_fit))[2,4]
p_value
#return p-value
return(p_value)
}
# set range of sample sizes
<- seq(from = 20, to = 500, by = 10)
ns
# set number of iterations
<- 1000
iterations
# perform power analysis
for(n in ns){
<- replicate(iterations, sim(n, p0 = prob_tau, p1 = prob_btheb, prop = prop))
p_values <- rbind(results, data.frame(
results n = n,
power = sum(p_values < .005)/iterations)
) }
Let’s visualize the results.
ggplot(results, aes(x=n, y=power)) + geom_point() + geom_line() + scale_y_continuous(n.breaks = 10) + scale_x_continuous(n.breaks = 20) + geom_hline(yintercept= 0.8, color = "red")
This plot shows that we need approx. 220 participants to achieve 80% under the given assumptions. Note that this is the overall sample size, not the size per condition! That’s it - we have performed a simulation-based power analysis for a logistic regression!
Verification with the pwr2ppl package
Luckily, there are also R packages that can perform these kinds of power analyses, for example the pwr2ppl
package (Aberson, 2019). We can use this package to verify our results. Does the pwr2ppl
package yield the same result? Note that you will need to install this package if you haven’t used it before.
#install.packages("devtools")
#devtools::install_github("chrisaberson/pwr2ppl")
::LRcat(p0 = prob_tau, p1 = prob_btheb, prop = .50,alpha = .005, power = .80) pwr2ppl
Sample Size = 221 for Odds Ratio = 2.761
That’s basically the same result, well done!
Using a safeguard power approach
Of note, our effect size estimates (i.e, the estimates of the probabilities of not having a moderate/severe depression two months after the BtheB
or the TAU
treatment) were so far based on a pilot study. However, this pilot study might not have yielded a precise estimate of these effect sizes. Thus, in order to consider uncertainty in these effect size estimates, it has been suggested to perform a safeguard power analysis (see chapter first chapter on linear models for more info) instead of a power analysis using the observed effect size. The main idea behind the safeguard power analysis is to compute a 60% confidence interval around the observed effect size, and to take the lower bound of this confidence interval as an effect size estimate (see Perugini et al., 2014). Let’s try this here.
Let’s assume that we view the probability of not having a moderate/severe depression after the TAU
treatment to be probably accurate, but that we want to account for uncertainty in the estimation of the probability of not having a moderate/severe depression after the BtheB
treatment. We then simply calculate the 60% confidence interval around this effect size. We can use the BinomCI
function from the DescTools
package to do this. We need to provide this function with the number of successes (here: 37, see table above) and the number of observations (here: 52, see above).
::BinomCI(37, 52, conf.level = 0.60) DescTools
est lwr.ci upr.ci
[1,] 0.7115385 0.6560993 0.761292
This gives us an estimate of 0.66 for the lower bound of the 60% confidence interval of the probability of not having a moderate/severe depression after the BtheB
treatment, while the point estimate for this effect size was 0.71. We can now redo our power analysis from above with this new effect size estimation. I am copying the chunk from above, while replacing the prob_btheb
value with 0.66.
#prepare empty vector to store the p-values
<- NULL
p_value
#prepare empty vector to store the results (i.e., the power per sample size)
<- data.frame()
results
# set range of sample sizes
<- seq(from = 20, to = 500, by = 10)
ns
# set number of iterations
<- 1000
iterations
# perform power analysis
for(n in ns){
<- replicate(iterations, sim(n, p0 = prob_tau, p1 = 0.66, prop = prop))
p_values <- rbind(results, data.frame(
results n = n,
power = sum(p_values < .005)/iterations)
)
}
#let's plot this
ggplot(results, aes(x=n, y=power)) + geom_point() + geom_line() + scale_y_continuous(n.breaks = 10) + scale_x_continuous(n.breaks = 20) + geom_hline(yintercept= 0.8, color = "red")
Here, we get a total sample size of ca. 360 participants in order to ensure 80% power with our safeguard estimation.
References
Aberson, C. L. (2019). Applied power analysis for the behavioral sciences. Routledge.
Perugini, M., Gallucci, M., & Costantini, G. (2014). Safeguard power as a protection against imprecise power estimates. Perspectives on Psychological Science, 9(3), 319-332. https://journals.sagepub.com/doi/10.1177/1745691614528519
Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Statistics in Medicine, 25(1), 127–141. https://doi.org/10.1002/sim.2331