CONTENTS
Identifying Aberrant Responses in Intelligent Tutoring Systems: An Application of Anomaly Detection Methods
Guher Gorgun & Okan Bulut
PDF of the full article
A Machine Learning Approach for Detecting Item Compromise and Preknowledge in Computerized Adaptive Testing
Yiqin Pan, Sandip Sinharay, Oren Livne & James A. Wollack
PDF of the full article
Data Augmentation in Machine Learning for Cheating Detection in Large-Scale Assessment: An Illustration with the Blending Ensemble Learning Algorithm
Todd Zhou & Hong Jiao
PDF of the full article
Considerations in Using XGBoost Models with SHAP Credit Assignment to Calculate Student Growth Percentiles
Steven Tang & Zhen Li
PDF of the full article
Automated Scoring of Constructed-Response Items Using Artificial Neural Networks in International Large-scale Assessment
Ji Yoon Jung, Lillian Tyack & Matthias von Davier
PDF of the full article
Mapping Between Hidden States and Features to Validate Automated Essay Scoring Using DeBERTa Models
Christopher Michael Ormerod
PDF of the full article
Identifying Aberrant Responses in Intelli-gent Tutoring Systems: An Application of Anomaly Detection Methods
Guher Gorgun & Okan Bulut
Abstract
Examinees’ unexpected response behaviors during an assessment may lead to aberrant re-sponses that contaminate the data quality. Since aberrant responses may jeopardize the validity of inferences made based on assessment results, they should be handled for modeling students’ learning and progress more accurately. Although the detection of aberrant responses is widely studied in non-interactive low-stakes assessments, exploring aberrant responses in interactive assessment environments such as intelligent tutoring systems (ITS) is a relatively new venue. Furthermore, current aberrant response detection methods are not feasible for the ITS context due to the extreme sparsity of response data. In this study, we employed six unsupervised anom-aly detection methods (Gaussian Mixture Model, Bayesian Gaussian Mixture Model, Isolation Forest, Mahalanobis Distance, Local Outlier Factor, and Elliptic Envelope) for identifying ab-errant responses in an ITS environment. We compared the results of these methods with each other and explored their association with students’ affective states. We found that the anomaly detection methods flagged similar responses as aberrant although Local Outlier Factor yielded very different results. Mahalanobis Distance appeared to be a conservative approach in detect-ing aberrant responses, whereas the Isolation Forest and Gaussian Mixture methods emerged as more liberal. Overall, the unsupervised anomaly detection methods provide a viable option for identifying aberrant responses in ITS. We recommend that researchers and practitioners consider using multiple anomaly detection methods to identify aberrant responses more accu-rately.
Keywords: aberrant responding, unsupervised anomaly detection, intelligent tutoring system, response time, hint use
Guher Gorgun
Measurement, Evaluation, and Data Science
Faculty of Education
University of Alberta
6-110 Education Centre North
11210 87 Ave, NW,
Edmonton, AB, T6G 2G5, Canada
email: gorgun@ualberta.ca
A Machine Learning Approach for Detect-ing Item Compromise and Preknowledge in Computerized Adaptive Testing
Yiqin Pan, Sandip Sinharay Oren Livne & James A. Wollack
Abstract
Item compromise and preknowledge have become common concerns in educational testing. We propose a machine learning approach to simultaneously detect compromised items and exami-nees with item preknowledge in computerized adaptive testing. The suggested approach pro-vides a confidence score that represents the confidence that the detection result truly corre-sponds to item preknowledge and draws on ideas in ensemble learning, conducting multiple detections independently on subsets of the data and then combining the results. Each detection first classifies a set of responses as aberrant using a self-training algorithm and support vector machine, and identifies suspicious examinees and items based on the classification result. The confidence score is adapted, using the autoencoder algorithm, from the confidence score that Pan and Wollack (2022) suggested for non-adaptive tests. Simulation studies demonstrate that the proposed approach performs well in item preknowledge detection and the confidence score can provide helpful information for practitioners.
Keywords: test security, item preknowledge, machine learning, computerized adaptive testing, support vector machine, autoencoder
Yiqin Pan
Research and Evaluation Methodology
College of Education
University of Florida
1215 Norman Hall
Gainesville
FL 32611, USA.
ypan@coe.ufl.edu
Data Augmentation in Machine Learning for Cheating Detection in Large-Scale Assessment: An Illustration with the Blending Ensemble Learning Algorithm
Todd Zhou & Hong Jiao
Abstract
Machine learning methods have been explored for cheating detection in large-scale assessment in recent years. Most of these studies analyzed item responses and response time data. Though a few studies investigated data augmentation in the feature space, data augmentation in machine learning for cheating detection is far beyond thorough investigation. This study explored data augmentation of the feature space for the blending ensemble learning at the meta-model level for cheating detection. Four anomaly detection techniques assigned outlier scores to augment the meta-model’s input data in addition to the most informative features from the original da-taset identified by four feature selection methods. The performance of the meta-model with data augmentation was compared with that of each base model and the meta-model without data augmentation. Based on the evaluation criteria, the best-performing meta-model with data aug-mentation was identified. In general, data augmentation in the blending ensemble learning for cheating detection greatly improved the accuracy of cheating detection compared with other alternative approaches.
Keywords: data augmentation, cheating detection, blending ensemble learning, anomaly de-tection algorithm
Hong Jiao
Measurement, Statistics and Evaluation
Department of Human Development and Quantitative Methodology
1230C Benjamin Building
University of Maryland
College Park
MD 20742, USA.
hjiao@umd.edu
Considerations in Using XGBoost Models with SHAP Credit Assignment to Calculate Student Growth Percentiles
Steven Tang & Zhen Li
Abstract
The wealth of student data collected in education enables machine learning to be a promising option to provide further insight into predicting important outcomes in a student’s education as machine learning approaches can handle increased data sources and data volume. As a promi-nent machine learning approach, Gradient Boosted Models (GBMs) have been shown to be a potential alternative methodology in place of the commonly used quantile-regression (QR) based procedure to estimate student growth percentiles (SGP). This study discusses aspects of using GBMs in computing growth percentiles by 1) illustrating the effects of different hyperpa-rameters on model fit, 2) comparing GBM and QR-based SGP agreement across different sets of predictors, 3) using an interpretability method, SHAP (SHapley Additive exPlanations), to show the impact of each predictor on the predictions of the GBM model, and 4) analyzing the effect of sample size on GBM prediction accuracy. The dataset in this study comes from math tests for grades 3 to 8 across 4 years of a state summative assessment
Keywords: Gradient boosted models, student growth percentiles, SHAP
Steven Tang
eMetric
211 N Loop 1604 E Suite 170 San Antonio
TX 78232, USA.
steven@emetric.net
Automated Scoring of Constructed-Response Items Using Artificial Neural Networks in International Large-scale Assessment
Ji Yoon Jung, Lillian Tyack and Matthias von Davier
Abstract
Although constructed-response items have proven effective in assessing students’ higher-order cognitive skills, their wider use has been limited in international large-scale assessments (IL-SAs) due to the resource-intensive nature and the challenges associated with human scoring. This study presents automated scoring based on artificial neural networks (ANNs) as feasible support for, or as an alternative to, human scoring. We examined the comparability of human and automated scoring for short constructed-response items from TIMSS 2019. The results showed that human and automated scores were highly correlated on average (r=0.91). Moreo-ver, this study found that a novel approach of adopting expected scores generated from item response theory (IRT) can be useful for quality control. The ANN-based automated scoring provided equally high or even improved agreements when it was trained on the data which is weighted or filtered based on IRT-based scores. This study argues that automated scoring has great potential to enable resource-efficient and consistent scoring in place of human scoring and, consequently, facilitate the greater use of constructed-response items in ILSAs.
Keywords: International large-scale assessment, eTIMSS, constructed-response items, automated scoring, artificial neural networks, natural language processing
Ji Yoon Jung
TIMSS & PIRLS International Study Center at Boston College
140 Commonwealth Ave.
Chestnut Hill, MA 02467, USA.
jiyoon.jung@bc.edu
Mapping Between Hidden States and Features to Validate Automated Essay Scoring Using DeBERTa Models
Christopher Michael Ormerod
Abstract
We introduce a regression-based framework to explore the dependence that global features have on score predictions from pretrained transformer-based language models used for Automated Essay Scoring (AES). We demonstrate that neural networks use approximations of rubric-rele-vant global features to determine a score prediction. By considering linear models on the hidden states, we can approximate global features and measure their importance to score predictions. This study uses DeBERTa models trained on overall scores and trait-level scores to demonstrate this framework with a specific focus on convention errors, which are errors in the use of lan-guage, encompassing spelling, grammar, and punctuation errors. This introduces a new form of explainability and provides evidence of validity for Language Model based AES.
Keywords: Automated Essay Scoring, Transformer, DeBERTa Model, Language Models, Ex-plainability, Overall Essay Scores, Trait-Level Essay Scores
Christopher Michael Ormerod
Cambium Assessment
Inc. 1000 Thomas Jefferson St.
N.W. Washington, D.C. 20007, USA.
christopher.ormerod@cambiumassessment.com
Psychological Test and Assessment Modeling
Volume 64 · 2022 · Issue 4
Pabst, 2022
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)