Bonus: Fit indices in SEM

Author

Moritz Fischer

Reading/working time: ~15 min.

#install.packages(c("future.apply", "ggplot2", "lavaan", "MPsychoR"), dependencies = TRUE)

require(future.apply)
require(ggplot2)
require(lavaan)
require(MPsychoR)

In this chapter, we will turn to simulation-based power analysis for fit indices in the context of SEM. We will build on the model we have introduced in Chapter 5 (Structural Equation Modelling (SEM)), it is therefore recommendable to read this chapter first. The following chunk specifies this model, in which (latent) generalized prejudice is predicted by (latent) openness to experience (as conceptualized in the big five personality traits). We fit this model to the dataset by Bergh et al. (2016) in order to get a first impression of the model fit.

data("Bergh")

model_sem <- "generalized_prejudice =~ EP + DP + SP + HP
              openness =~ O1 + O2 + O3
              generalized_prejudice ~ openness"

#fit the SEM to the pilot data set
fit <- sem(model_sem, data = Bergh)

summary(fit, fit.measures = TRUE)
lavaan 0.6.15 ended normally after 40 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        15

  Number of observations                           861

Model Test User Model:
                                                      
  Test statistic                                40.574
  Degrees of freedom                                13
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                              2395.448
  Degrees of freedom                                21
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.988
  Tucker-Lewis Index (TLI)                       0.981

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -4740.818
  Loglikelihood unrestricted model (H1)      -4720.530
                                                      
  Akaike (AIC)                                9511.635
  Bayesian (BIC)                              9583.007
  Sample-size adjusted Bayesian (SABIC)       9535.371

Root Mean Square Error of Approximation:

  RMSEA                                          0.050
  90 Percent confidence interval - lower         0.033
  90 Percent confidence interval - upper         0.067
  P-value H_0: RMSEA <= 0.050                    0.482
  P-value H_0: RMSEA >= 0.080                    0.002

Standardized Root Mean Square Residual:

  SRMR                                           0.038

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                           Estimate  Std.Err  z-value  P(>|z|)
  generalized_prejudice =~                                    
    EP                        1.000                           
    DP                        0.710    0.041   17.383    0.000
    SP                        0.907    0.052   17.337    0.000
    HP                        1.042    0.112    9.267    0.000
  openness =~                                                 
    O1                        1.000                           
    O2                        0.932    0.035   26.255    0.000
    O3                        1.143    0.040   28.923    0.000

Regressions:
                          Estimate  Std.Err  z-value  P(>|z|)
  generalized_prejudice ~                                    
    openness                -0.766    0.056  -13.641    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .EP                0.219    0.016   13.403    0.000
   .DP                0.141    0.009   14.972    0.000
   .SP                0.233    0.015   15.079    0.000
   .HP                2.124    0.106   19.965    0.000
   .O1                0.073    0.005   14.423    0.000
   .O2                0.079    0.005   15.805    0.000
   .O3                0.052    0.005    9.823    0.000
   .generlzd_prjdc    0.191    0.018   10.362    0.000
    openness          0.161    0.011   14.210    0.000

There are many different fit indices displayed in this output, for example the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA) and the Standardized Root Mean Square Residual (SRMR). We can not go into the details of the interpretations of the fit indices here, but it is important to know that many of these indices are not very sensitive to sample size. Therefore, running a power analysis for these fit indices is not really meaningful. But instead of analyzing how one of these indices varies as a function of sample size, we can optimize the precision of one of these indices. For example, the lavaan output from above displays the 90% confidence interval for the RMSEA index. We could, for example, plan to find a certain sample size that ensures that the confidence interval around the RMSEA estimate has a certain maximum size, that is, the RMSEA estimate is sufficiently precise. This what we will learn to do in this chapter.

As a starting point, we again define the population parameters (i.e., the means, variances, and co-variances of all measured variables). We use the study by Bergh et al. (2016) to estimate these parameters (just as we did in Chapter 5).

attach(Bergh)

#store means
means_vector <- c(mean(EP), mean(SP), mean(HP), mean(DP), mean(O1), mean(O2), mean(O3)) |> round(2)

#store covariances
cov_mat <- cov(cbind(EP, SP, HP, DP, O1, O2, O3)) |> round(2)

With these parameters, we can simulate data using the rmvnorm function from the Rfast package. The only difference to the simulation described in Chapter 5 is that here, we do not calculate and store the p-value of the regression coefficient, but rather, we compute the width of the RMSEA confidence interval and store it in a vector. We then count the number of simulations that yield a confidence interval with a maximum size of, say, .10. The next chunk shows how this is done.

set.seed(9875234)

#test ns between 50 and 200
ns <- seq(50, 200, by=10) 

#prepare empty vector to store results
result <- data.frame()

#set number of iterations
iterations <- 1000

#write function
sim_sem <- function(n, model, mu, sigma) {
  

  simulated_data <- Rfast::rmvnorm(n = n, mu = mu, sigma = sigma) |> as.data.frame()
  fit_sem_simulated <- sem(model_sem, data = simulated_data)
  rmsea_ci_width <- as.numeric(fitMeasures(fit_sem_simulated)["rmsea.ci.upper"] - fitMeasures(fit_sem_simulated)["rmsea.ci.lower"])
  return(rmsea_ci_width)
  
    }


#replicate function with varying ns
for (n in ns) {  
  
rmsea_ci_width <- future_replicate(iterations, sim_sem(n = n, model = model_sem, mu = means_vector, sigma = cov_mat), future.seed=TRUE)  
result <- rbind(result, data.frame(
    n = n,
    power = sum(rmsea_ci_width < .1)/iterations)
  )

}

Let’s plot this.

ggplot(result, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")

This analysis suggests that approx. 168 participants are needed to obtain a 90%-confidence interval around the RMSEA coefficient that is not larger than .10 in 80% of the cases.