Forming a scale can sometimes be superior to individual metrics. For example, scales can sometimes provide more reliable results, capturing underlying constructs that are not as well-measured by individual items.

Here is an example of creating a scale using Exploratory Factor Analysis on data I gathered while working on a federally funded project called SnowPacs. Let’s load the data.

df <- read.xlsx(xlsxFile = "G:/Users/Chris/Google Drive/SnowCap/Scale Construction/SnowPacs Likert Items Year1-5 Factor column rem.xlsx", sheet = 1, skipEmptyRows = FALSE)
head(df)
##   Year Q4_1 Q4_2 Q4_3 Q4_4 Q4_5 Q8_1 Q8_2 Q8_3 Q8_4 Q8_5 Q8_6 Q8_7 Q8_8 Q8_9
## 1    5    7    2    6    5    5    5    7    2    3    6    7    7    7    6
## 2    5    7    1    7    6    6    6    7    1    2    7    7    7    7    7
## 3    5    3    6    2    4    3    4    4    4    1    7    7    7    7    3
## 4    5    4    4    5    3    2    6    6    6    3    7    7    7    7    4
## 5    5    6    1    6    5    7    5    7    2    2    7    7    6    7    6
## 6    5    3    5    3    1    3    7    6    2    2    7    7    7    7    3
##   Q11_1 Q11_2 Q11_3 Q11_4 Q11_5 Q11_6 Q11_7 Q11_8 Q11_9 Q11_10 Q11_11 Q11_12
## 1     4     6     6     2     3     1     3     7     7      5      2      7
## 2     5     5     6     1     7     1     2     7     7      7      6      7
## 3     3     6     6     5     5     6     5     4     3      3      3      2
## 4     2     2     2     6     3     6     2     4     4      4      2      2
## 5     4     5     5     3     6     2     2     7     7      5      4      7
## 6     5     5     3     6     6     5     2     6     5      3      3      3
##   Q11_13 Q11_14 Q12_1 Q12_2 Q12_3 Q12_4 Q12_5 Q12_6 Q12_7 Q12_8 Q12_9 Q12_10
## 1      5      5     2     1     1     2     1     6     6     1     1      2
## 2      6      6     2     2     1     6     1     5     7     1     6      5
## 3      2      1     4     7     5     3     5     6     2     4     1      1
## 4      2      2     6     4     6     5     5     7     5     2     3      3
## 5      4      4     5     2     1     4     2     6     6     2     6      2
## 6      2      3     7    NA     3     3     5     7     5     5     5      5
##   Q12_11 Q12_12 Q12_13 Q12_14 Q12_15 Q12_16 Q12_17 Q12_18 Q12_19 Q14_1 Q14_2
## 1      7     NA     NA     NA     NA     NA     NA     NA     NA     6     5
## 2      7      6      7      6      6      6      6      7      6     7     7
## 3      6      6      4      4      3      3      5      4      5     4     6
## 4      3      3      3      4      4      6      7      7      4     1     6
## 5      5      5      6      5      6      6      7      5      6     7     7
## 6      5      6      6      3      3      1      7      3      3     5     7
##   Q14_3 Q14_4 Q17_1 Q17_2 Q17_3 Q17_4 Q17_5 Q17_6 Q17_7 Q17_8 Q17_9 Q17_10
## 1     4     6     7    NA     2     3     2    NA     7     7     1      6
## 2     5     6     6     7     2     2     1     6     7     7     1      2
## 3     4     4     3     3     5     3     4     4     5     6     7      3
## 4     2     3     2     6     6     2     7     3     5     5     5      5
## 5     6     6     7     6     5     2     2     5     6     7     4      5
## 6     5     5     2     5     5     5     5     2     7     6     5      3
##   Q17_11
## 1      6
## 2      7
## 3      6
## 4      3
## 5      4
## 6      3

This is survey data retrieved from Qualtrics. We have administered a survey annually to project team members over the past five years. Although we can’t currently see the items we asked our respondents, it doesn’t matter all too much, because we are choosing our items statistically. We have over 60 items to pick from, so let’s get started!

First, we want to perform a Kaiser-Meyer-Olkin (KMO) test, to determine how well-suited our data is for factor analysis.

datamatrix <- cor(df, use="complete.obs")
KMO(datamatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = datamatrix)
## Overall MSA =  0.39
## MSA for each item = 
##   Year   Q4_1   Q4_2   Q4_3   Q4_4   Q4_5   Q8_1   Q8_2   Q8_3   Q8_4   Q8_5 
##   0.28   0.67   0.32   0.71   0.34   0.25   0.07   0.14   0.37   0.28   0.13 
##   Q8_6   Q8_7   Q8_8   Q8_9  Q11_1  Q11_2  Q11_3  Q11_4  Q11_5  Q11_6  Q11_7 
##   0.19   0.09   0.20   0.61   0.42   0.40   0.56   0.55   0.22   0.40   0.09 
##  Q11_8  Q11_9 Q11_10 Q11_11 Q11_12 Q11_13 Q11_14  Q12_1  Q12_2  Q12_3  Q12_4 
##   0.65   0.46   0.45   0.64   0.55   0.48   0.70   0.15   0.69   0.46   0.14 
##  Q12_5  Q12_6  Q12_7  Q12_8  Q12_9 Q12_10 Q12_11 Q12_12 Q12_13 Q12_14 Q12_15 
##   0.36   0.08   0.33   0.24   0.41   0.10   0.57   0.76   0.36   0.40   0.32 
## Q12_16 Q12_17 Q12_18 Q12_19  Q14_1  Q14_2  Q14_3  Q14_4  Q17_1  Q17_2  Q17_3 
##   0.75   0.19   0.26   0.58   0.84   0.26   0.42   0.71   0.59   0.80   0.11 
##  Q17_4  Q17_5  Q17_6  Q17_7  Q17_8  Q17_9 Q17_10 Q17_11 
##   0.18   0.22   0.30   0.48   0.65   0.31   0.45   0.56

Kaiser gave us the following rubrik to put our scores into context.

So currently our data is pretty poor in regards to scale construction. I’m going to remove anything with a score less that .40. Then we’ll re-run this analysis and see if there’s any improvement.

df1 <- df[, KMO(datamatrix)$MSAi>0.40]
datamatrix <- cor(df1,  use="complete.obs")
KMO(datamatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = datamatrix)
## Overall MSA =  0.88
## MSA for each item = 
##   Q4_1   Q4_3   Q8_9  Q11_1  Q11_2  Q11_3  Q11_4  Q11_8  Q11_9 Q11_10 Q11_11 
##   0.88   0.83   0.90   0.89   0.88   0.94   0.93   0.89   0.91   0.92   0.88 
## Q11_12 Q11_13 Q11_14  Q12_2  Q12_3  Q12_9 Q12_11 Q12_12 Q12_14 Q12_16 Q12_19 
##   0.93   0.91   0.93   0.85   0.87   0.79   0.87   0.85   0.86   0.82   0.84 
##  Q14_1  Q14_3  Q14_4  Q17_1  Q17_2  Q17_7  Q17_8 Q17_10 Q17_11 
##   0.90   0.85   0.83   0.89   0.89   0.86   0.78   0.62   0.87

This looks much better than it did before. Most of our items are above .8, and only one of our items is below a .78. We should check this data against a Bartlett’s test for sphericity. That will give us another indicator if our data is suitable for factor analysis.

cortest.bartlett(df1)
## R was not square, finding R from data
## $chisq
## [1] 2189.882
## 
## $p.value
## [1] 5.744438e-221
## 
## $df
## [1] 465

Our test was significant, suggesting that our data is suitable for FA. If we run a scree plot on this data, it should illuminate how many factors we might have.

fa.parallel(df1, fa='both')

## Parallel analysis suggests that the number of factors =  3  and the number of components =  3

According to this, we could have as few as three factors.

For this analysis, I’m going to use a maximum likelihood estimation for the factoring method and varimax (orthogonal) for the rotation. These are the default options for the ‘factanal’ function we will use in a moment.

fa.pm <- fa(r=na.omit(df1), nfactors = 3, fm="ml", max.iter=500, rotate="varimax")
print(fa.pm)
## Factor Analysis using method =  ml
## Call: fa(r = na.omit(df1), nfactors = 3, rotate = "varimax", max.iter = 500, 
##     fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          ML1   ML2   ML3   h2   u2 com
## Q4_1    0.82  0.20 -0.05 0.71 0.29 1.1
## Q4_3    0.90  0.19  0.00 0.84 0.16 1.1
## Q8_9    0.75 -0.01  0.34 0.68 0.32 1.4
## Q11_1   0.45  0.66  0.07 0.65 0.35 1.8
## Q11_2   0.16  0.69  0.25 0.56 0.44 1.4
## Q11_3   0.35  0.67  0.34 0.69 0.31 2.1
## Q11_4  -0.59 -0.47 -0.13 0.58 0.42 2.0
## Q11_8   0.67  0.18  0.48 0.71 0.29 2.0
## Q11_9   0.74  0.28  0.31 0.72 0.28 1.7
## Q11_10  0.69  0.43  0.23 0.72 0.28 1.9
## Q11_11  0.31  0.67  0.26 0.60 0.40 1.7
## Q11_12  0.59  0.40  0.43 0.69 0.31 2.7
## Q11_13  0.52  0.74  0.21 0.85 0.15 2.0
## Q11_14  0.69  0.23  0.18 0.55 0.45 1.4
## Q12_2  -0.54 -0.40 -0.11 0.47 0.53 1.9
## Q12_3  -0.53 -0.20 -0.29 0.41 0.59 1.9
## Q12_9   0.29 -0.02  0.50 0.34 0.66 1.6
## Q12_11  0.30  0.24  0.52 0.42 0.58 2.1
## Q12_12  0.13  0.35  0.49 0.38 0.62 2.0
## Q12_14  0.26  0.81  0.14 0.74 0.26 1.3
## Q12_16  0.24  0.80  0.01 0.70 0.30 1.2
## Q12_19 -0.04  0.60  0.13 0.38 0.62 1.1
## Q14_1   0.44  0.43  0.39 0.53 0.47 3.0
## Q14_3   0.20  0.36  0.77 0.76 0.24 1.6
## Q14_4   0.16  0.39  0.69 0.65 0.35 1.7
## Q17_1   0.55  0.53  0.32 0.68 0.32 2.6
## Q17_2   0.49  0.57  0.02 0.56 0.44 2.0
## Q17_7   0.36  0.14  0.66 0.59 0.41 1.7
## Q17_8  -0.03  0.20  0.68 0.51 0.49 1.2
## Q17_10  0.04  0.14 -0.43 0.21 0.79 1.2
## Q17_11  0.25  0.42  0.52 0.52 0.48 2.4
## 
##                        ML1  ML2  ML3
## SS loadings           7.28 6.55 4.61
## Proportion Var        0.23 0.21 0.15
## Cumulative Var        0.23 0.45 0.59
## Proportion Explained  0.39 0.35 0.25
## Cumulative Proportion 0.39 0.75 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  465  and the objective function was  32.6 with Chi Square of  1950.47
## The degrees of freedom for the model are 375  and the objective function was  10.28 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.06 
## 
## The harmonic number of observations is  72 with the empirical chi square  215.25  with prob <  1 
## The total number of observations was  72  with Likelihood Chi Square =  594.72  with prob <  3.4e-12 
## 
## Tucker Lewis Index of factoring reliability =  0.808
## RMSEA index =  0.089  and the 90 % confidence intervals are  0.077 0.104
## BIC =  -1009.03
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML3
## Correlation of (regression) scores with factors   0.96 0.95 0.94
## Multiple R square of scores with factors          0.93 0.91 0.88
## Minimum correlation of possible factor scores     0.85 0.82 0.77

Our root mean square residual is .06, suggesting a good fit. Our RMSEA is a little higher than I would like, but our TLI is less than .9. It looks like three factors are sufficient.

fit <- factanal(na.omit(df1), factors=3, rotation="varimax")
print(fit, digits=2, cutoff=0.65, sort=TRUE)
## 
## Call:
## factanal(x = na.omit(df1), factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##   Q4_1   Q4_3   Q8_9  Q11_1  Q11_2  Q11_3  Q11_4  Q11_8  Q11_9 Q11_10 Q11_11 
##   0.29   0.16   0.32   0.35   0.44   0.31   0.42   0.29   0.28   0.28   0.40 
## Q11_12 Q11_13 Q11_14  Q12_2  Q12_3  Q12_9 Q12_11 Q12_12 Q12_14 Q12_16 Q12_19 
##   0.31   0.15   0.45   0.53   0.59   0.66   0.58   0.62   0.26   0.30   0.62 
##  Q14_1  Q14_3  Q14_4  Q17_1  Q17_2  Q17_7  Q17_8 Q17_10 Q17_11 
##   0.47   0.24   0.35   0.32   0.44   0.41   0.49   0.79   0.48 
## 
## Loadings:
##        Factor1 Factor2 Factor3
## Q4_1    0.82                  
## Q4_3    0.90                  
## Q8_9    0.75                  
## Q11_4                         
## Q11_8   0.67                  
## Q11_9   0.74                  
## Q11_10  0.69                  
## Q11_12                        
## Q11_14  0.69                  
## Q12_2                         
## Q12_3                         
## Q17_1                         
## Q11_1           0.66          
## Q11_2           0.69          
## Q11_3           0.67          
## Q11_11          0.67          
## Q11_13          0.74          
## Q12_14          0.81          
## Q12_16          0.80          
## Q12_19                        
## Q17_2                         
## Q12_9                         
## Q12_11                        
## Q14_3                   0.77  
## Q14_4                   0.69  
## Q17_7                   0.66  
## Q17_8                   0.68  
## Q17_11                        
## Q12_12                        
## Q14_1                         
## Q17_10                        
## 
##                Factor1 Factor2 Factor3
## SS loadings       7.28    6.55    4.61
## Proportion Var    0.23    0.21    0.15
## Cumulative Var    0.23    0.45    0.59
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 594.72 on 375 degrees of freedom.
## The p-value is 3.37e-12

Here are our reduced factor loadings, making it easier to read. For this example, I’m going to focus on the first two factors because they contain more items and explain more of the variance.

At this point, one should look at the specific items in question and determine what common themes the survey items might load onto. For our data, Factor 1 items appear to be related to Team Communication and Cohesion, while Factor 2 items center around Project Goals and Outputs.

Let’s check our alphas to make sure our scales demonstrate internal consistency.

factor1 <- df[c(16,17,18,26,28,43,45)]
factor2 <- df[c(2,4,15,23,24,25,29)]
cronbach.alpha(factor1, na.rm = TRUE)
## 
## Cronbach's alpha for the 'factor1' data-set
## 
## Items: 7
## Sample units: 81
## alpha: 0.934
cronbach.alpha(factor2, na.rm = TRUE)
## 
## Cronbach's alpha for the 'factor2' data-set
## 
## Items: 7
## Sample units: 81
## alpha: 0.93

Our alphas for our two factors are both about .93, which is considered ‘excellent’ by convention. We have 7 items for each factor, so let’s determine our factor scores for each participant.

df$Factor1 <- rowSums(df[,c(16,17,18,26,28,43,45)])/7
df$Factor2 <- rowSums(df[,c(2,4,15,23,24,25,29)])/7

Now that we have factor scores for each participant, we can plot our factor scores against our Year variable to observe any overall trends in the data. Plus, everyone loves a nice looking plot!

As an asied, these colors are colorblind friendly!

f1plot <- ggplot(df, aes(x = Year, y = Factor1, fill = Year)) + 
  geom_bar(position = "dodge", stat = "summary", fun = "mean") 

f1plot <- f1plot + scale_fill_manual(values = c("#fde725", "#5ec962", "#21918c", "#3b528b", "#440154")) + coord_cartesian(ylim=c(3.75,4.5))

f1plot <- f1plot + stat_summary(aes(label=round(..y..,2)), fun.y=mean, geom="text", size=5, vjust = -0.5)

f1plot <- f1plot + ggtitle("Factor 1 by Year\nTeam Cohesion and Communication, α = .934") +
  theme(plot.title = element_text(size=22, hjust = 0.5), legend.position = "none") + ylab("Factor 1 Scores")

f1plot

And here is what the final product looks like as a .png file.

Similarly, we can plot Factor 2 against our Year variable as well.

f2plot <- ggplot(df, aes(x = Year, y = Factor2, fill = Year)) + 
  geom_bar(position = "dodge", stat = "summary", fun = "mean") 

f2plot <- f2plot + scale_fill_manual(values = c("#fde725", "#5ec962", "#21918c", "#3b528b", "#440154")) + coord_cartesian(ylim=c(4.5,5.75))

f2plot <- f2plot + stat_summary(aes(label=round(..y..,2)), fun.y=mean, geom="text", size=5, vjust = -0.5)

f2plot <- f2plot + ggtitle("Factor 2 by Year\nProject Goals and Outputs, α = .930") +
  theme(plot.title = element_text(size=22, hjust = 0.5), legend.position = "none") + ylab("Factor 2 Scores")

f2plot

And again, the final image represented in .png form.