Multi-Facet Rasch Models with R

In this post, we’ll explore the Multi-Facet Rasch Model, understand how it works, and demonstrate how to fit this model using R. We’ll also visualize the results to better interpret them.

readr
tidyr
ggplot2
dplyr
tam
cowplot
Author
Affiliations

Ali Emre Karagül

Published

October 17, 2024

Mailchimp Subscription Modal

Introduction

While a basic Rasch model focuses on item difficulty and person ability, the Multi-Facet Rasch Model (MFRM) allows us to incorporate additional factors, or facets, such as:

  • Person Ability (e.g., the skill level of test-takers),

  • Item Difficulty (e.g., how hard the test items are),

  • Rater Severity (e.g., how lenient or strict raters are),

  • Task or Stimulus Differences (e.g., variation in tasks given).

MFRM is said to be IRT version of generalizability theory and it is particularly useful when assessments involve subjective judgments, like in essay grading or performance evaluation, where raters’ subjectivity can introduce bias.

1. Understand the data

For MFRM analysis, we are going to use a dataset of essay scores scored on an analytical rubric. There are four domains of the rubric: Content, Prompt Adherence, Language, and Narrativity. Let’s load the data and see the head of them. You can download the data for your own use from here.

data <- read_csv("MFRM_data.csv", show_col_types = FALSE)
head(data)
# A tibble: 6 × 6
    Con ProAd  Lang   Nar rater essayId
  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1     4     4     3     3     1       1
2     2     2     4     3     1       2
3     4     4     4     4     1       3
4     3     2     4     4     1       4
5     4     4     4     4     1       5
6     1     1     1     2     1       6

Let’s see the structure and summary of the data too.

str(data)
spc_tbl_ [5,400 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Con    : num [1:5400] 4 2 4 3 4 1 1 2 4 3 ...
 $ ProAd  : num [1:5400] 4 2 4 2 4 1 1 2 4 3 ...
 $ Lang   : num [1:5400] 3 4 4 4 4 1 1 4 4 4 ...
 $ Nar    : num [1:5400] 3 3 4 4 4 2 1 4 4 2 ...
 $ rater  : num [1:5400] 1 1 1 1 1 1 1 1 1 1 ...
 $ essayId: num [1:5400] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "spec")=
  .. cols(
  ..   Con = col_double(),
  ..   ProAd = col_double(),
  ..   Lang = col_double(),
  ..   Nar = col_double(),
  ..   rater = col_double(),
  ..   essayId = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
summary(data)
      Con            ProAd            Lang            Nar            rater  
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :1  
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1  
 Median :2.000   Median :2.000   Median :2.000   Median :2.000   Median :2  
 Mean   :1.878   Mean   :1.784   Mean   :2.062   Mean   :1.958   Mean   :2  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:3  
 Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :3  
    essayId      
 Min.   :   1.0  
 1st Qu.: 450.8  
 Median : 900.5  
 Mean   : 900.5  
 3rd Qu.:1350.2  
 Max.   :1800.0  

There are 1800 rows of data. Each domain is scored between 0 and 4. Perfect as zero must exist in the ordinal data for MFRM. The data set consists of scores from three raters on four domains, so we need to account for the three key facets: person ability (the essays), item difficulty (the domains), and rater severity (the three raters).

2. Fit the model

Now that we’ve explored our data set, it’s time to fit the Multi-Facet Rasch Model (MFRM). To do this, we’ll use the TAM package in R, which provides functions for fitting various Rasch models, including MFRM. The formulaA provided into the mfr function decides on the model. For PCM, we define interaction between item and step along with the rater facet:

facets <- data[, "rater", drop=FALSE]  # define facet (rater)
pid <- data$essayId  # define person identifier
resp <- data[, -c(5:6)]  # item response data
formulaA <- ~item*step + rater   # formula for PCM


mod <- TAM::tam.mml.mfr(resp=resp, facets=facets, formulaA=formulaA, pid=data$essayId)

3. Model diagnostics

Now that we’ve fitted our Multi-Facet Rasch Model (MFRM) with PCM, let’s take a closer look at the results and explore some diagnostics.

fit_stats <- TAM::tam.fit(mod)
Item fit calculation based on 15 simulations
|**********|
|----------|
print(fit_stats)
$itemfit
     parameter    Outfit   Outfit_t      Outfit_p  Outfit_pholm     Infit
1          Con 2.0591522  39.361636  0.000000e+00  0.000000e+00 2.1979266
2        ProAd 1.5301160  21.894545 2.928160e-106 2.635344e-105 1.6156500
3         Lang 2.2535201  47.334138  0.000000e+00  0.000000e+00 2.3403618
4          Nar 1.8429272  33.147291 6.196043e-241 6.815647e-240 1.9492346
5        step1 0.9494058  -3.647927  2.643645e-04  5.287290e-04 0.9369712
6        step2 0.4917571 -52.934061  0.000000e+00  0.000000e+00 0.4102024
7        step3 0.2640096 -96.962995  0.000000e+00  0.000000e+00 0.1816552
8        step4 0.9177011  -6.474657  9.502735e-11  3.801094e-10 0.5978796
9       rater1 0.2665086 -83.397247  0.000000e+00  0.000000e+00 0.2045924
10      rater2 0.2126247 -96.365947  0.000000e+00  0.000000e+00 0.1305165
11      rater3 0.3471616 -70.514947  0.000000e+00  0.000000e+00 0.1985980
12   Con:step1 0.8398256  -8.668278  4.387050e-18  2.193525e-17 0.8568111
13 ProAd:step1 0.6746766 -17.811336  5.771558e-71  4.617246e-70 0.7091680
14  Lang:step1 0.5180443 -29.210127 1.442217e-187 1.442217e-186 0.5202205
15   Nar:step1 0.6858151 -16.202919  4.808891e-59  3.366224e-58 0.7063609
16   Con:step2 0.5296627 -33.996303 2.526439e-253 3.031727e-252 0.4427496
17 ProAd:step2 0.4289933 -42.461731  0.000000e+00  0.000000e+00 0.3497186
18  Lang:step2 0.1969640 -75.560345  0.000000e+00  0.000000e+00 0.1640074
19   Nar:step2 0.3698636 -46.072478  0.000000e+00  0.000000e+00 0.3082262
20   Con:step3 0.2804214 -66.678683  0.000000e+00  0.000000e+00 0.2265406
21 ProAd:step3 0.2855422 -62.035576  0.000000e+00  0.000000e+00 0.2089630
22  Lang:step3 0.1888063 -83.889912  0.000000e+00  0.000000e+00 0.1150833
23   Nar:step3 0.2510021 -64.460639  0.000000e+00  0.000000e+00 0.1836688
24   Con:step4 0.9436611  -3.155082  1.604530e-03  1.604530e-03 0.5763329
25 ProAd:step4 0.8958462  -5.556046  2.759535e-08  8.278605e-08 0.5200940
26  Lang:step4 0.4643230 -36.676871 1.707308e-294 2.219500e-293 0.2799968
27   Nar:step4 0.7677798 -12.393932  2.818774e-35  1.691264e-34 0.4609557
       Infit_t       Infit_p   Infit_pholm
1    43.404732  0.000000e+00  0.000000e+00
2    24.924232 4.064309e-137 2.032154e-136
3    49.864878  0.000000e+00  0.000000e+00
4    36.540091 2.561982e-292 2.818181e-291
5    -4.563129  5.039684e-06  5.039684e-06
6   -64.563804  0.000000e+00  0.000000e+00
7  -117.289175  0.000000e+00  0.000000e+00
8   -36.103138 2.024933e-285 2.024933e-284
9   -96.095647  0.000000e+00  0.000000e+00
10 -117.784465  0.000000e+00  0.000000e+00
11  -98.836404  0.000000e+00  0.000000e+00
12   -7.698737  1.374172e-14  2.748343e-14
13  -15.680881  2.043984e-55  8.175938e-55
14  -29.043052 1.883337e-185 1.318336e-184
15  -15.006236  6.683443e-51  2.005033e-50
16  -42.353408  0.000000e+00  0.000000e+00
17  -51.040009  0.000000e+00  0.000000e+00
18  -81.785151  0.000000e+00  0.000000e+00
19  -52.982695  0.000000e+00  0.000000e+00
20  -75.326822  0.000000e+00  0.000000e+00
21  -73.806223  0.000000e+00  0.000000e+00
22 -101.040972  0.000000e+00  0.000000e+00
23  -75.311205  0.000000e+00  0.000000e+00
24  -27.699201 7.138532e-169 4.283119e-168
25  -30.243052 6.438903e-201 5.151122e-200
26  -56.207040  0.000000e+00  0.000000e+00
27  -33.454655 2.202305e-245 1.982075e-244

$time
[1] "2024-12-05 13:54:56 +03" "2024-12-05 13:54:56 +03"

$CALL
TAM::tam.fit(tamobj = mod)

attr(,"class")
[1] "tam.fit"
  • Infit and Outfit Values:

    • Infit and Outfit near 1: Indicates that the item fits well with the model.

    • Infit/Outfit significantly >1: Indicates that the item is underfitting, meaning there is more variability in the responses than the model expects (perhaps caused by noise or misfitting responses).

    • Infit/Outfit significantly <1: Indicates that the item is overfitting, meaning the responses are too predictable, and there’s less variability than expected by the model (possibly due to redundancy or lack of challenge).

    Let’s break down a few examples from the output:

    • Content (Con):

      • Outfit MNSQ: 2.05, Infit MNSQ: 2.18

      • These values are well above 1, indicating underfit. The item “Con” might be too noisy or not behaving consistently with the model.

    • Prompt Adherence (ProAd):

      • Outfit MNSQ: 1.54, Infit MNSQ: 1.63

      • These values are higher than 1 but still in the acceptable range, meaning there’s some noise, but it’s not excessive.

    • Language (Lang):

      • Outfit MNSQ: 2.26, Infit MNSQ: 2.34

      • These values suggest significant underfit, similar to “Con”, indicating that responses to this domain might be less consistent or more unpredictable than the model expects.

    • Steps (step1 to step4):

      • Some steps, such as step1, have Infit and Outfit values closer to 1 (e.g., Outfit MNSQ: 0.95, Infit MNSQ: 0.94). These are acceptable and suggest that step1 is fitting well.

      • However, step2, step3, and step4 show extremely low values, especially for Outfit (e.g., step3 has an Outfit MNSQ of 0.26), indicating overfit, meaning these categories are too predictable and might not differentiate well between respondents.

  • t-statistics and p-values:

    • The t-statistics (Outfit_t, Infit_t) are standardized fit statistics that test whether the Infit/Outfit values significantly differ from 1. Large positive or negative t-values indicate significant deviation from expected values.

    • The p-values (Outfit_p, Infit_p) show whether these deviations are statistically significant. Nearly all p-values are extremely low (close to 0), suggesting that most of the items are statistically misfitting according to the model.

  • Rater Severity:

    • Rater1, Rater2, Rater3 all have very low Infit and Outfit values (e.g., Rater1 Outfit: 0.26), which suggest that these raters may be overfitting. This could mean that they are scoring in a highly predictable way, possibly being too strict or lenient consistently.
reliability <- mod$EAP.rel
reliability
[1] 0.9716057

The final WLE Reliability = 0.97 is an excellent reliability score, meaning that the person ability estimates are very consistent. WLE reliability, similar to other reliability coefficients like Cronbach’s alpha, indicates the precision of the estimates:

A 0.97 reliability means that 97% of the variance in the person ability estimates is due to true differences in ability rather than measurement error.

persons.mod <- TAM::tam.wle(mod)
Iteration in WLE/MLE estimation  1   | Maximal change  2.8113 
Iteration in WLE/MLE estimation  2   | Maximal change  2.8413 
Iteration in WLE/MLE estimation  3   | Maximal change  2.5741 
Iteration in WLE/MLE estimation  4   | Maximal change  2.1203 
Iteration in WLE/MLE estimation  5   | Maximal change  2.3278 
Iteration in WLE/MLE estimation  6   | Maximal change  0.4228 
Iteration in WLE/MLE estimation  7   | Maximal change  0.0637 
Iteration in WLE/MLE estimation  8   | Maximal change  0.0071 
Iteration in WLE/MLE estimation  9   | Maximal change  8e-04 
Iteration in WLE/MLE estimation  10   | Maximal change  1e-04 
----
 WLE Reliability= 0.97 
ggplot(data.frame(theta = persons.mod$theta), aes(x = theta)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "black") +
  labs(title = "Distribution of Person Ability Estimates", x = "Ability (Theta)", y = "Frequency")

Also see the theta distributions in the chart. They do not look nice as this is a study run on simulated data.

thr <- TAM::tam.threshold(mod)
thr
                    Cat1       Cat2       Cat3     Cat4
Con-rater1    -8.3634338 -4.8121033 -1.2173767 2.406464
Con-rater2    -4.3898621 -0.8385315  2.7560120 6.380035
Con-rater3    -0.5052795  3.0462341  6.6957092       NA
ProAd-rater1  -8.5340881 -4.4764709 -0.6612854 3.438995
ProAd-rater2  -4.5606995 -0.5028992  3.3122864 7.412567
ProAd-rater3  -0.6759338  3.3818665  7.2305603       NA
Lang-rater1  -10.0357361 -5.8611145 -1.8104553 2.433380
Lang-rater2   -6.0621643 -1.8877258  2.1631165 6.406952
Lang-rater3   -2.1775818  1.9970398  6.0768127       NA
Nar-rater1    -9.4873352 -5.3533630 -1.3221130 2.769928
Nar-rater2    -5.5137634 -1.3799744  2.6514587 6.743317
Nar-rater3    -1.6289978  2.5047913  6.5700989       NA

Ordered thresholds are crucial to ensure that the categories are functioning properly. For example, for Con-rater1, the thresholds are: -8.36, -4.81, -1.21, and 2.41

These thresholds are in increasing order, which indicates that the rating scale is working as intended for Con-rater1—the higher categories represent more difficult levels to achieve.

Raters 1, 2, and 3 Comparison:

  • Rater Differences: There are noticeable differences between raters in their thresholds. For example:

    • Con-rater1 has very negative thresholds, starting at -8.36 for Cat1, while Con-rater3 starts much higher, with thresholds beginning at -0.50.

    • This suggests that Rater1 is much stricter or uses a harsher scale, while Rater3 is more lenient, with easier transitions between categories. For instance, it is harder for essays to move from a “1” to a “2” under Rater1’s scoring compared to Rater3.

    • There are some NAs which actually I have no idea about. Peobably these inconsistancies occur due to the simulated data.

4. Visualizing the Results

To make the interpretation more intuitive, we can visualize the item difficulty and rater severity using a dot plot for the difficulty estimates, which can help us compare how each domain and rater behaves. Here’s how we can generate these plots in R.

facet_params<-mod[["xsi.facets"]][["parameter"]]
domain_params<-facet_params[1:4]

f1 <- ggplot(data = persons.mod, aes(x = theta))+
  geom_dotplot(binwidth = .1, stackdir = "down") + 
  theme_bw()  +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_x_continuous(breaks=seq(-6, 6, .6), limits=c(-6, 6), 
                     position =  "top") + 
  theme(axis.title.y = element_blank(), 
        axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  labs(title = "Persons") + 
  coord_flip()

f2 <- mod$xsi.facets %>%
  filter(str_starts(parameter, "rater")) %>%
  ggplot(aes(x = xsi)) +
  geom_text(aes(y = 0, label = parameter), nudge_y = 0.05, size = 3) +
  theme_bw() +
  scale_y_continuous(name = "", breaks = NULL) +
  scale_x_continuous(breaks = seq(-6, 6, .5), limits = c(-6, 6), position = "top") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.x = element_blank()) +
  labs(title = "Raters") + 
  coord_flip()

f3 <- mod$xsi.facets %>%
  filter(parameter %in% domain_params) %>%
  ggplot(aes(x = xsi)) +
  geom_text(aes(y = 0, label = parameter), nudge_y = 0.05, size = 3) + 
    theme_bw()  +
    scale_y_continuous(name = "", breaks = NULL) +
    scale_x_continuous(breaks=seq(-2, 2, .2), limits=c(-2, 2), 
                       position =  "top") + 
    theme(axis.title.y=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks.y=element_blank(),
          axis.title.x= element_blank())+
    labs(title = "Domain") + 
    coord_flip()

plot_grid(f1, f2, f3, nrow = 1, rel_widths = c(0.7, .15, .15))

This final chart is developed as an alternative to wrightmap. Each facet can be seen easily on it. There are four grids. The first is the person thetas. We have seen this above. The second is the rater facet. The strictness of the raters are very distinctive. Actually the only real data here is rater 2 and the others were simulated using it to be stricter and lenient. So we exactly see what the data is about. The third grid is about the domains/item difficulty. Pompt Adherence is the most difficult domain. Content, Narrative and Language follows it respectively.

5. Conclusion

In this post, we explored the Multi-Facet Rasch Model (MFRM) using simulated essay scores rated by multiple raters across four different domains: Content, Prompt Adherence, Language, and Narrativity. The model helped us account for the varying levels of item difficulty and the potential differences in rater severity. By fitting the MFRM and examining key model outputs—like Infit/Outfit statistics and thresholds—we identified areas where raters were either more lenient or more severe, and items that displayed more noise or predictability than expected.

The high WLE reliability of 0.97 indicates that the model provides consistent and accurate estimates of person abilities. However, the rater-specific thresholds revealed some important differences in how each rater scored the essays, with certain raters being significantly stricter or more lenient. This highlights the importance of accounting for rater bias in assessments that rely on subjective judgments, such as essay scoring.

Going forward, addressing these rater differences and ensuring well-functioning rating categories can further refine the assessment process. By doing so, we can ensure that the scores are fairer and more representative of true essay quality, free from the influence of individual rater biases. Overall, the MFRM proves to be a valuable tool in maintaining the validity and reliability of assessments involving subjective judgments.