NEWSBÜCHERJOURNALESHOP



 

Sie befinden sich hier: JOURNALE » Psychological Test and Assessment Modeling » Currently available » Inhalt lesen

« zurück

Psychological Test and Assessment Modeling

» Psychological Test and Assessment Modeling Online-Shop...


2019-2

Human rater monitoring with automated scoring engines
Hyo Jeong Shin, Edward Wolfe & Mark Wilson
PDF of the full article

Item Response Theory (IRT) analysis of the SKT Short Cognitive Performance Test
Mark Stemmler, Ferdinand Keller & Raphaela Fasan
PDF of the full article

LOO and WAIC as model selection methods for polytomous items
Yong Luo
PDF of the full article

Estimation of the parameters of the reduced RUM model by simulated annealing
Youn Seon Lim, Fritz Drasgow & Justin Kern
PDF of the full article

NRM-based scoring methods for situational judgment tests
Hongwen Guo, Jiyun Zu & Patrick C. Kyllonen
PDF of the full article

Relationship between item characteristics and detection of Differential Item Functioning under the MIMIC model
Daniella A. Rebouças & Ying Cheng
PDF of the full article
 


Human rater monitoring with automated scoring engines
Hyo Jeong Shin, Edward Wolfe & Mark Wilson

Abstract

This study presents a case study that applies mixed-effects ordered probit models for the purpose of utilizing scores from automated scoring engines (AE) to monitor and provide diagnostic feedback to human raters under training. Using the experimental rater training study data, we illustrate a statistical approach that can be used for analyzing three types of model-based rater effects – severity, accuracy, and centrality of each rater. Each of the rater effects is related with model parameters and compared for cases (a) when the AE is considered as the gold standard and (b) when the human expert (HE) is considered as the gold standard. Results showed that AE and HE scoring approaches agreed maximally (100%) in detecting severity. The agreement rate was somewhat lower for centrality (93.1%) and considerably lower for accuracy (66.4%). As a targeted case study, this examination concludes with practical implications and cautions for rater monitoring based on the AE.

Keywords: automated scoring, human scoring, rater effects, rater monitoring


Hyo Jeong Shin, PhD
University of California at Berkeley
Educational Testing Service
660 Rosedale Road, 13-E
Princeton, NJ 08541, USA

 


Human rater monitoring with automated scoring engines
Hyo Jeong Shin, Edward Wolfe & Mark Wilson

Abstract

This study presents a case study that applies mixed-effects ordered probit models for the purpose of utilizing scores from automated scoring engines (AE) to monitor and provide diagnostic feedback to human raters under training. Using the experimental rater training study data, we illustrate a statistical approach that can be used for analyzing three types of model-based rater effects – severity, accuracy, and centrality of each rater. Each of the rater effects is related with model parameters and compared for cases (a) when the AE is considered as the gold standard and (b) when the human expert (HE) is considered as the gold standard. Results showed that AE and HE scoring approaches agreed maximally (100%) in detecting severity. The agreement rate was somewhat lower for centrality (93.1%) and considerably lower for accuracy (66.4%). As a targeted case study, this examination concludes with practical implications and cautions for rater monitoring based on the AE.

Keywords: automated scoring, human scoring, rater effects, rater monitoring


Hyo Jeong Shin, PhD
University of California at Berkeley
Educational Testing Service
660 Rosedale Road, 13-E
Princeton, NJ 08541, USA

 


LOO and WAIC as model selection methods for polytomous items
Yong Luo

Abstract

Watanabe-Akaike information criterion (WAIC; Watanabe, 2010) and leave-one-out cross validation (LOO) are two fully Bayesian model selection methods that have been shown to perform better than other traditional information-criterion based model selection methods such as AIC, BIC, and DIC in the context of dichotomous IRT model selection. In this paper, we investigated whether such superior performances of WAIC and LOO can be generalized to scenarios of polytomous IRT model selection. Specifically, we conducted a simulation study to compare the statistical power rates of WAIC and LOO with those of AIC, BIC, AICc, SABIC, and DIC in selecting the optimal model among a group of polytomous IRT ones. We also used a real data set to demonstrate the use of LOO and WAIC for polytomous IRT model selection. The findings suggest that while all seven methods have excellent statistical power (greater than 0.93) to identify the true polytomous IRT model, WAIC and LOO seem to have slightly lower statistical power than DIC, the performance of which is marginally inferior to those of AIC, BIC, AICc, and SABIC. 

Keywords: polytomous IRT, Bayesian, MCMC, model comparison


Yong Luo, PhD
National Center for Assessment
West Palm Neighborhood
King Khalid Road
Riyadh, 11534, Saudi Arabia

 


Estimation of the parameters of the reduced RUM model by simulated annealing
Youn Seon Lim, Fritz Drasgow & Justin Kern

Abstract

In this study, a simulation-based method for computing joint maximum likelihood estimates of the reduced reparameterized unified model parameters is proposed. The central theme of the approach is to reduce the complexity of models to focus on their most critical elements. In particular, an approach analogous to joint maximum likelihood estimation is taken, and the latent attribute vectors are regarded as structural parameters, not parameters to be removed by integration with this approach, the joint distribution of the latent attributes does not have to be specified, which reduces the number of parameters in the model.

Keywords: cognitive diagnosis model, reduced reparameterized unified model, Markov Chain Monte Carlo, joint maximum likelihood, simulated annealing


Youn Seon Lim
Department of Science Education
Donald and Barbara Zucker School of Medicine at Hofstra/Northwell
500 Hofstra University
Hempstead, NY 11549, USA

 


NRM-based scoring methods for situational judgment tests
Hongwen Guo, Jiyun Zu & Patrick C. Kyllonen

Abstract

Situational judgment tests (SJTs) show useful levels of validity as predictors for job performance. However, scoring SJTs is challenging. We proposed to use the nominal response model (NRM)-based scoring methods for SJTs. Using real data from an SJT, we illustrated how to setup the NRM-based scoring rules and their rationales, how to examine dimensionality and reliability, and how to evaluate item-, measurement- and score- invariance across subgroups at different time points. We also compared the NRM-based scores with other commonly-used scoring approaches in terms of their relationships with relevant external variables for the studied SJT test.

Keywords: NRM, SJT, scoring rules, reliability, validity


Hongwen Guo
Educational Testing Service
660 Rosedale Rd MS 12-T
Princeton NJ 08541, USA

 


Relationship between item characteristics and detection of Differential Item Functioning under the MIMIC model
Daniella A. Rebouças & Ying Cheng

Abstract

Differential item functioning (DIF) occurs when individuals of the same true latent ability or psychological trait from different demographic populations are found to have different chances of endorsing an item category. The ability to identify such items depends on many factors, including the sample size of each demographic group, average true latent trait scores in each group, the chosen DIF assessment method, the magnitude of DIF effect and the quality of the anchor set. An anchor is a group of items free of DIF that establish a common metric between groups. If the anchor is contaminated, that is, if it contains a DIF item, the common metric is inappropriate. The current literature rarely addresses the relationship between item parameters, anchor selection, and subsequent DIF detection. In this two-part study, we show that the power of DIF detection is high when the anchor has highly discriminating items. Additionally, DIF items of large discrimination and moderate difficulty have generally high power when using a correctly specified anchor, given a fixed DIF effect size. Implications for anchor selection and DIF effect size research are discussed.

Keywords: differential item functioning, anchor, item difficulty, item discrimination, effect size


Ying Cheng
University of Notre Dame
390 Corbett Family Hall
Notre Dame, IN, 46556, USA

 



Psychological Test and Assessment Modeling
Volume 61 · 2019 · Issue 2

Pabst, 2019
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)

» Psychological Test and Assessment Modeling Online-Shop...





alttext