#install.packages(c("future.apply", "ggplot2", "lavaan", "MPsychoR"), dependencies = TRUE)
require(future.apply)
require(ggplot2)
require(lavaan)
require(MPsychoR)
Bonus: Fit indices in SEM
Reading/working time: ~15 min.
In this chapter, we will turn to simulation-based power analysis for fit indices in the context of SEM. We will build on the model we have introduced in Chapter 5 (Structural Equation Modelling (SEM)), it is therefore recommendable to read this chapter first. The following chunk specifies this model, in which (latent) generalized prejudice is predicted by (latent) openness to experience (as conceptualized in the big five personality traits). We fit this model to the dataset by Bergh et al. (2016) in order to get a first impression of the model fit.
data("Bergh")
<- "generalized_prejudice =~ EP + DP + SP + HP
model_sem openness =~ O1 + O2 + O3
generalized_prejudice ~ openness"
#fit the SEM to the pilot data set
<- sem(model_sem, data = Bergh)
fit
summary(fit, fit.measures = TRUE)
lavaan 0.6.15 ended normally after 40 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 15
Number of observations 861
Model Test User Model:
Test statistic 40.574
Degrees of freedom 13
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 2395.448
Degrees of freedom 21
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.988
Tucker-Lewis Index (TLI) 0.981
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -4740.818
Loglikelihood unrestricted model (H1) -4720.530
Akaike (AIC) 9511.635
Bayesian (BIC) 9583.007
Sample-size adjusted Bayesian (SABIC) 9535.371
Root Mean Square Error of Approximation:
RMSEA 0.050
90 Percent confidence interval - lower 0.033
90 Percent confidence interval - upper 0.067
P-value H_0: RMSEA <= 0.050 0.482
P-value H_0: RMSEA >= 0.080 0.002
Standardized Root Mean Square Residual:
SRMR 0.038
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
generalized_prejudice =~
EP 1.000
DP 0.710 0.041 17.383 0.000
SP 0.907 0.052 17.337 0.000
HP 1.042 0.112 9.267 0.000
openness =~
O1 1.000
O2 0.932 0.035 26.255 0.000
O3 1.143 0.040 28.923 0.000
Regressions:
Estimate Std.Err z-value P(>|z|)
generalized_prejudice ~
openness -0.766 0.056 -13.641 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.EP 0.219 0.016 13.403 0.000
.DP 0.141 0.009 14.972 0.000
.SP 0.233 0.015 15.079 0.000
.HP 2.124 0.106 19.965 0.000
.O1 0.073 0.005 14.423 0.000
.O2 0.079 0.005 15.805 0.000
.O3 0.052 0.005 9.823 0.000
.generlzd_prjdc 0.191 0.018 10.362 0.000
openness 0.161 0.011 14.210 0.000
There are many different fit indices displayed in this output, for example the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA) and the Standardized Root Mean Square Residual (SRMR). We can not go into the details of the interpretations of the fit indices here, but it is important to know that many of these indices are not very sensitive to sample size. Therefore, running a power analysis for these fit indices is not really meaningful. But instead of analyzing how one of these indices varies as a function of sample size, we can optimize the precision of one of these indices. For example, the lavaan
output from above displays the 90% confidence interval for the RMSEA index. We could, for example, plan to find a certain sample size that ensures that the confidence interval around the RMSEA estimate has a certain maximum size, that is, the RMSEA estimate is sufficiently precise. This what we will learn to do in this chapter.
As a starting point, we again define the population parameters (i.e., the means, variances, and co-variances of all measured variables). We use the study by Bergh et al. (2016) to estimate these parameters (just as we did in Chapter 5).
attach(Bergh)
#store means
<- c(mean(EP), mean(SP), mean(HP), mean(DP), mean(O1), mean(O2), mean(O3)) |> round(2)
means_vector
#store covariances
<- cov(cbind(EP, SP, HP, DP, O1, O2, O3)) |> round(2) cov_mat
With these parameters, we can simulate data using the rmvnorm
function from the Rfast
package. The only difference to the simulation described in Chapter 5 is that here, we do not calculate and store the p-value of the regression coefficient, but rather, we compute the width of the RMSEA confidence interval and store it in a vector. We then count the number of simulations that yield a confidence interval with a maximum size of, say, .10. The next chunk shows how this is done.
set.seed(9875234)
#test ns between 50 and 200
<- seq(50, 200, by=10)
ns
#prepare empty vector to store results
<- data.frame()
result
#set number of iterations
<- 1000
iterations
#write function
<- function(n, model, mu, sigma) {
sim_sem
<- Rfast::rmvnorm(n = n, mu = mu, sigma = sigma) |> as.data.frame()
simulated_data <- sem(model_sem, data = simulated_data)
fit_sem_simulated <- as.numeric(fitMeasures(fit_sem_simulated)["rmsea.ci.upper"] - fitMeasures(fit_sem_simulated)["rmsea.ci.lower"])
rmsea_ci_width return(rmsea_ci_width)
}
#replicate function with varying ns
for (n in ns) {
<- future_replicate(iterations, sim_sem(n = n, model = model_sem, mu = means_vector, sigma = cov_mat), future.seed=TRUE)
rmsea_ci_width <- rbind(result, data.frame(
result n = n,
power = sum(rmsea_ci_width < .1)/iterations)
)
}
Let’s plot this.
ggplot(result, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")
This analysis suggests that approx. 168 participants are needed to obtain a 90%-confidence interval around the RMSEA coefficient that is not larger than .10 in 80% of the cases.