Inhalt lesen- psychologie-aktuell

Analysis of Differential Item Functioning in PROMIS® Pediatric and Adult Measures between Adolescents and Young Adults with Special Health Care Needs
Dan V. Blalock, Li Lin, Mian Wang, David Thissen, Darren A. DeWalt, I-Chan Huang & Bryce B. Reeve
PDF of the full article

Detecting Differential Item Functioning of Polytomous Items in Small Samples: Comparison of MIMIC with a Pure Anchor and MIMIC-Interaction Methods
Gavin T. L. Brown, Maryam Alqassab, Okan Bulut & Jiaying Xiao
PDF of the full article

Rater-Mediated Listening Assessment: A Facets Modeling Approach to the Analysis of Raters’ Severity and Accuracy When Scoring Responses to Short-Answer Questions
Thomas Eckes
PDF of the full article

Construction of Psychometrically Sound Written University Exams
Andreas Frey, Christian Spoden & Sebastian Born
PDF of the full article

Using Diagnostic Classification Models to Obtain Subskill Information and Explore its Relationship with Total Scores: the Case of the Michigan English Test
Ren Liu
PDF of the full article

Take your time: Invariance of time-on-task in problem solving tasks across expertise levels
Matthias Stadler, Anika Radkowitsch, Ralf Schmidmaier, Martin Fischer & Frank Fischer
PDF of the full article

Analysis of Differential Item Functioning in PROMIS® Pediatric and Adult Measures between Adolescents and Young Adults with Special Health Care Needs

Abstract:
Purpose: Many research studies seek to assess health outcomes among patients across the adolescent-adult age groups. This age group distinction often leads to independent scale creation and validation in self-report measures, such as the Patient-Reported Outcomes Measurement Information System® (PROMIS®) health-related quality of life (HRQOL) measures. Research studies would benefit from the ability to use a single measure across these age groups.

Method: This study is a secondary data analysis of adolescents (age 14-17) and young adults (age 18-20) with special healthcare needs (n = 874). Participants completed short forms of both PROMIS pediatric and adult measures of physical functioning, pain, fatigue, depression, social health, anxiety, and anger. Differential item functioning (DIF) across age groups was examined using Wald tests for graded response model (GRM) item parameters.

Results: No DIF across age group was observed for any item in any of the pediatric or adult short form measures.

Conclusion: These results support the flexible use of pediatric and adult PROMIS HRQOL scales for adolescents and young adults age 14-20.

Keywords:PROMIS, Item Response Theory, Differential Item Functioning, Health-Related Quality of Life, Pediatric

Dan V. Blalock, PhD
Durham Center of Innovation to Accelerate
Discovery and Practice Transformation (ADAPT)
Center for Health Services Research in
Primary Care
Durham Veterans Affairs Health Care System
411 West Chapel Hill Street, Suite 600
Durham, NC 27705
daniel.blalock@duke.edu

Detecting Differential Item Functioning of Polytomous Items in Small Samples: Comparison of MIMIC with a Pure Anchor and MIMIC-Interaction Methods

Abstract:
Differential item functioning (DIF) may be a result of either item bias or a real difference depending on whether the source of DIF is either construct-irrelevant or construct-relevant. It is relatively more challenging to conduct DIF studies when the sample size is small (i.e., < 200), items follow polytomous scoring (e.g., Likert scales) instead of dichotomous scoring, and psychological grouping variables are used instead of demographic grouping variables (e.g., gender). However, the multiple indicators-multiple causes (MIMIC) approach can be a promising solution to address the aforementioned challenges in DIF studies. This study aims to investigate the performance of two MIMIC methods, namely MIMIC with a pure anchor (MIMIC-PA) and MIMIC-interaction methods, for DIF detection in the Student Conceptions of Assessment inventory based on a psychological grouping variable derived from students’ self-efficacy and subject interest. The results show that MIMIC-PA identified five mathematics and eight reading items with large DIF in the four factors. MIMIC-interaction showed that no items had uniform DIF, while four items had non-uniform DIF. Items with statistically significant DIF were aligned with the known effects of self-efficacy and subject interest on academic achievement, supporting the claim that observed DIF reflects item impact rather than bias. The study's implications for practice and directions for future research with the MIMIC approach are discussed.

Keywords:Differential item functioning, polytomous scales, small sample, MIMIC

Professor Gavin T. L. Brown, PhD
School of Learning, Development and
Professional Practice
Faculty of Education
The University of Auckland
Private Bag 92019
Auckland, 1142
New Zealand
gt.brown@auckland.ac.nz

Rater-Mediated Listening Assessment: A Facets Modeling Approach to the Analysis of Raters’ Severity and Accuracy When Scoring Responses to Short-Answer Questions

Abstract:
Short-answer questions are a popular item format in listening tests. Examinees listen to spoken input and demonstrate comprehension by responding to questions about the information contained in the input. Usually, human raters or markers score examinee responses as correct or incorrect following a scoring guide. Considering this procedure an instance of the more general class of rater-mediated language assessment, the present research adopted a many-facet Rasch measurement approach to provide a detailed look at the psychometric quality of the listening scores. Nine operational raters and one expert rater scored responses of 200 examinees to 15 short-answer questions included in the listening section of a standardized language test. The findings revealed that (a) raters differed significantly in their severity measures, albeit to a lesser extent than typically observed in writing or speaking assessments, (b) raters did not show evidence of differential severity across short-answer questions, (c) raters evidenced an overall high level of scoring accuracy, but also showed non-negligible differences in their accuracy measures, and (d) raters did not show evidence of differential accuracy across short-answer questions. Implications for the validity and fairness of using short-answer questions in listening tests as well as for rater training and monitoring purposes are discussed.

Keywords: rater-mediated assessment, listening assessment, short-answer questions, facets models, rater severity, rater accuracy

Thomas Eckes, PhD
TestDaF Institute
University of Bochum
Universitätsstr. 134
44799 Bochum,
Germany
thomas.eckes@gast.de

Construction of Psychometrically Sound Written University Exams

Abstract:
Written university exams typically used at German-speaking universities often do not represent the learning objectives of the respective course appropriately. Moreover, they do not allow for criterion-referenced inferences regarding the degree to which the learning objectives have been met, and they are statistically unconnected across different test cycles. To overcome these shortcomings, we propose applying a combination of established methods from the fields of educational measurement and psychometrics to written university exams. The key elements of the proposed procedure are (a) the definition of the content domain of interest in relation to the learning objectives of the course, (b) the specification of an assessment framework, (c) the operationalization of the assessment framework with test items, (d) the standardized administration of the exam, (e) the scaling of gathered responses with item response theory models, and ( f) the setting of grade levels with standard-setting procedures. Empirical results obtained from
six test cycles of a real university exam at the end of an introductory course on research methods in education show that this procedure can successfully be applied in a typical university setting. It was possible to constitute a reliable and valid scale and maintain it across the six test cycles based on a common item nonequivalent group design. The comparison of the observed student competence distributions across the six years gave interesting insights that can be used to optimize the course.

Keywords: item response theory, higher education, testing, measurement

Andreas Frey, PhD
Goethe University Frankfurt
Theodor-W.-Adorno-Platz 6
60323 Frankfurt
Germany
frey@psych.uni-frankfurt.de

Using Diagnostic Classification Models to Obtain Subskill Information and Explore its Relationship with Total Scores: the Case of the Michigan English Test

Abstract:
Subskills are often identified to develop items on a test. Investigating the relationship between examinees’ overall scores and their performance on subskills are often of interest in educational and psychological tests. The purpose of this study is to explore subskill information on the Michigan English Test (MET) using the diagnostic classification model framework. Through subskill identification, model fitting and selection, an appropriate diagnostic classification model was chosen for answering three research questions regarding, namely, the subskill mastery sequence, the relationship between subskill mastery and overall scores, and the relationship between subskill mastery and the Common European Framework of Reference (CEFR) levels. Findings from this study provide additional validity evidence for the interpretation and use of the MET scores. They could also be used by content experts to understand more about the subskills, and by the MET item/test development professionals for item revision and/or form assembly.

Keywords: diagnostic classification model, subskill mastery, attribute hierarchy, language assessment, Michigan English Test, CEFR levels

Prof. Ren Liu, PhD
Quantitative Methods
Measurement, and Statistics
University of California
Merced
Merced, CA 95343.
rliu45@ucmerced.edu

Take your time: Invariance of timeon-task in problem solving tasks across expertise levels

Abstract:
Computer-based tasks provide a vast amount of behavioral data that can be analyzed in addition to the indicators of final performance. One of the most commonly investigated indicators is time-on-task (ToT), which is understood as the time from task onset to task completion. Studies often assume a unidimensional measurement model with one latent ToT variable that is sufficient to capture all response time covariance across items. However, behavioral indicators such as ToT are seldom submitted to the same psychometric rigor as more traditional indicators. In this brief report, we provide first results on the invariance of ToT in problem-solving tasks across different levels of expertise. A total of 98 medical students and physicians participated in the study quasi-experimentally grouped into three conditions (low, intermediate, and high) based on their prior knowledge. All participants solved five medical diagnostic problem-solving tasks in a simulation-based learning environment. While the overall ToT seems to decrease with level of expertise, the general pattern across tasks seems to be similar for all three groups. The results indicate strong measurement invariance of ToT across different levels of expertise and support interpreting group differences on a latent ToT factor.

Keywords: Time-on-task, Behavior, Invariance, Expertise, Assessment

Matthias Stadler, PhD
Leopoldstr. 13
80802 München
Germany
Matthias.Stadler@lmu.de

Psychological Test and Assessment Modeling
Volume 62 · 2020 · Issue 4
Pabst, 2020
ISSN 2190-0493 (Print)
ISSN 2190-0507 (Internet)

Zurück