Inhalt lesen- psychologie-aktuell

CONTENTS

Model Selection for Latent Dirichlet Allocation In Assessment Data
Constanza Mardones-Segovia, Jordan M. Wheeler, Hye-Jeong Choi, Shiyu Wang, Allan S. Cohen
.PDF of the full article

Predicting Oral Reading Fluency Scores by Between-Word Silence Times Using Natural Language Processing and Random Forest Algorithm
Yusuf Kara, Akihito Kamata, Emrah Emre Ozkeskin, Xin Qiao and Joseph F. T. Nese
.PDF of the full article

Automated Distractor Generation for Fill-in-the-Blank Items Using a Prompt-Based Learning Approach
Jiyun Zu, Ikkyu Choi and Jiangang Hao
.PDF of the full article

Detecting Atypical Test-Taking Behavior with Behavior Prediction Using LSTM
Steven Tang, Siju Samuel and Zhen Li
.PDF of the full article

Detection of AI-generated Essays in Writing Assessments
Duanli Yan, Michael Fauss, Jiangang Hao and Wenju Cui
.PDF of the full article

Predicting Problem-Solving Proficiency with Multiclass Hierarchical Classification on Process Data: A Machine Learning Approach
Qiwei He, Qingzhou Shi, Elizabeth L. Tighe
.PDF of the full article

Machine Learning and Deep Learning in Assessment
Hong Jiao, Qiwei He, Lihua Yao
.PDF of the full article

Model Selection for Latent Dirichlet Allocation In Assessment Data
Constanza Mardones-Segovia, Jordan M. Wheeler, Hye-Jeong Choi, Shiyu Wang, Allan S. Cohen

Abstract
Latent Dirichlet Allocation (LDA) is a probabilistic topic model that has been used as a tool to detect the latent thematic structure in a body of text. In the context of classroom testing, LDA has been used to detect the latent themes in examinees’ responses to constructed-response (CR) items. There is a growing body of evidence that latent themes detected by LDA have been found to reflect the kinds of reasoning examinees use in their responses to CR items. The use of the information from a model such as LDA requires that the model fit the data. To this end, a number of different model selection indices have been used with LDA to determine the best model fit. There does not as yet appear to be clear evidence, however, as to which of these indices is most accurate in conditions common with measurement data. In this study, we eval-uated the performance of several model selection indices, including similarity measures and perplexity using 5-fold cross-validation. Their performance for model selection was compared from two commonly used algorithms for estimation of the LDA model, Gibbs sampling and variational expectation-maximization. Data were simulated with different numbers of topics, documents, average lengths of answers, and numbers of unique words typical of practical meas-urement conditions. Results suggested that the average cosine similarity and perplexity using 5-fold cross-validation were most accurate for model selection over the conditions simulated in this study.

Keywords: latent Dirichlet allocation, model selection, perplexity, k-fold cross-validation, sim-ilarity measures, Gibbs sampling, variational expectation maximization

Constanza Mardones-Segovia
125P Aderhold Hall
Department of Educational Psychology
Mary Frances Early
The University of Georgia
Athens, GA, 30602
E-mail: cam04214@uga.edu

Abstract
The measurement of oral reading fluency (ORF) is an important part of screening assessments for identifying students at risk of poor reading outcomes. ORF is a complex construct that in-volves speed, accuracy, and coherent reading abilities including prosody. This study aimed at using between-word-level silence times collected through a computer-based reading assessment system to predict words read correctly per minute (WCPM) scores of young readers as the measure of their ORF levels. Natural language processing (NLP) was utilized to analyze read-ing passages to inform the locations of syntactically dependent words, namely, meaningful word chunks. Then, silence times before and after the NLP-informed word chunks were used to predict WCPM scores via a random forest algorithm. The results revealed that students’ av-erage relative silence times before and after specific word chunks were good predictors of WCPM scores. Also, the model was able to explain more than half of the variation in WCPM scores by using the derived silence times and students’ grade-levels as predictors.

Keywords: oral reading fluency, machine learning, natural language processing, random for-ests

Yusuf Kara,
Center on Research and Evaluation
Southern Methodist University
6116 N. Central Expressway, Suite 400
Dallas, TX 75206. E-mail: ykara@smu.edu

Automated Distractor Generation for Fill-in-the-Blank Items Using a Prompt-Based Learning Approach
Jiyun Zu, Ikkyu Choi, Jiangang Hao

Abstract
There are heavy demands for large and continuous supplies of new items in language testing. Automated item generation (AIG), in which computerized algorithms are used to create test items, can potentially increase the efficiency of new item development to serve this demand. A challenge for multiple-choice items is to write effective distractors, that is, incorrect yet attrac-tive (Haladyna, 2004). We propose a prompt-based learning approach (Liu et al., 2021) for automatically generating distractors for one of the most common language-assessment item types, fill-in-the-blank vocabulary items. The proposed method treats distractor generation as a natural language generation task and utilizes a transformer-based, pretrained language model (Radford et al., 2019) fine-tuned to ensure appropriate and useful output. The fine-tuning pro-cess adopted a prompt-based learning approach, which has been found to be particularly effec-tive in small-sample scenarios (Gao et al., 2021). We illustrate this approach on a specific item type from a standardized English language proficiency assessment. Specifically, we study the effects of different prompts and demonstrate the effectiveness of the proposed prompt-based learning approach by comparing features of generated distractors with those from a rule-based approach.

Keywords: Automated distractor generation, automated item generation, natural language pro-cessing, deep learning language models, prompt-based learning

Jiyun Zu
Educational Testing Service
660 Rosedale Road
Princeton, NJ
08540
E-mail: jzu@ets.org

Detecting Atypical Test-Taking Behavior with Behavior Prediction Using LSTM
Steven Tang, Siju Samuel, Zhen Li

Abstract
A clickstream is a precise log of every user action taken in a software application. Clickstreams recorded during test taking experiences can be analyzed for behavior patterns. In this paper, we introduce a statistic, known as the Model Agreement Index (MAI), that quantifies how typical or atypical an examinee’s clickstream’s behaviors are relative to a sequence model of behavior; this model is trained to emulate student behaviors using a Long Short-Term Memory (LSTM) network. MAI is intended to be used as a simple statistic to detect instances of atypical user behaviors so that further analysis can be conducted to identify whether the atypical behaviors need to be mitigated in the future. One of the empirical results from this study is that certain examinees with low MAI scores were floundering on the opening and closing of certain tool widgets. This floundering caused wasted time for the examinee, and the discovery of this phe-nomenon can enable an improvement in the test user interface, showing a good use for the proposed methodology. The study details the processes needed to train the LSTM along with a comparison between the LSTM and a “most common next action” baseline model. Addition-ally, correlations of MAI with other indicators, such as answer changing, are explored. The use of MAI to identify test user interface issues is demonstrated. Real data from a statewide testing program is used in this study.

Keywords: test security, atypical behavior detection, clickstream analysis, behavior modeling, scalable detection methods

steven@emetric.net

Detection of AI-generated Essays in Writing Assessments
Duanli Yan, Michael Fauss, Jiangang Hao, Wenju Cui

Abstract
The recent advance in AI technology has led to tremendous progress in automated text genera-tion. Powerful language models, such as the GPT-3 and ChatGPT from OpenAI and BARD from Google, can generate high-quality essays when provided with a simple prompt. This paper shows how AI-generated essays are similar or different from human-written essays based on a set of typical prompts for a sample from a large-scale assessment. We also introduce two clas-sifiers that can detect AI-generated essays with a high accuracy of over 95%. The goal of this study is for researchers to think and develop methodologies to address these issues and to ensure the quality of writing assessments.

Keywords: AI technology, GPT-3/ChatGPT, Writing assessment, Test security

Duanli Yan
Educational Testing Service
Princeton, NJ
08541
dyan@ets.org

Predicting Problem-Solving Proficiency with Multiclass Hierarchical Classification on Process Data: A Machine Learning Approach
Qiwei He, Qingzhou Shi, Elizabeth L. Tighe

Abstract
Increased use of computer-based assessments has facilitated data collection processes that cap-ture both response product data (i.e., correct and incorrect) and response process data (e.g., time-stamped action sequences). Evidence suggests a strong relationship between respondents’ correct/incorrect responses and their problem-solving proficiency scores. However, few studies have reported the predictability of fine-grained process information on respondents’ problem-solving proficiency levels and the degree of granularity needed for accurate prediction. This study uses process data from interactive problem-solving items in the Programme for the Inter-national Assessment of Adult Competencies (PIAAC) to predict proficiency levels with hierar-chical classification methods. Specifically, we extracted aggregate-level process variables and item-specific sequences of problem-solving strategies. Two machine learning methods – ran-dom forest and support vector machine – affiliated with two multiclass hierarchical classification approaches (i.e., flat classification and hierarchical classification) were examined. Using seven problem-solving items from the U.S. PIAAC process data sample, we found that the hi-erarchical approach affiliated with any machine learning method performed moderately better than the flat approach in proficiency level prediction. This study demonstrates the feasibility of using process variables to classify respondents by problem-solving proficiency levels, and thus, supports the development of tailored instructions for adults at different levels.

Keywords: multiclass hierarchical classification, flat classification, machine learning, process data, problem-solving proficiency, PIAAC

Qiwei He
Educational Testing Service
660 Rosedale Road
Princeton, NJ 08541
qhe@ets.org

Machine Learning and Deep Learning in Assessment
Hong Jiao, Qiwei He, Lihua Yao

Psychological Test and Assessment Modeling
Volume 65 · 2023 · Issue 1
Pabst, 2023
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)

Zurück