**Psychological Test and Assessment Modeling, Volume 55, 2013 (1)**A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies

*André A. Rupp*

Abstract | PDF of the full article

Robustness and power of the parametric t test and the nonparametric Wilcoxon test under non-independence of observations

*Wolfgang Wiedermann & Alexander von Eye*

Abstract | PDF of the full article

The position effect in tests with a time limit: the consideration of interruption and working speed

*Karl Schweizer & Xuezhu Ren*

Abstract | PDF of the full article

**Special topic:**

Current issues in educational and psychological measurement: Design, calibration, and adaptive testing - Part II

Guest editors: Ulf Kröhne & Andreas Frey

Guest editorial

Current issues in educational and psychological measurement: Design, calibration, and adaptive testing - Part II

Guest editors: Ulf Kröhne & Andreas Frey

*Ulf Kröhne & Andreas Frey*

PDF of the full article

Effect of item order on item calibration and item bank construction for computer adaptive tests

*Otto B. Walter & Matthias Rose*

Abstract | PDF of the full article

Too hard, too easy, or just right? The relationship between effort or boredom and ability-difficulty fit

*Regine Asseburg & Andreas Frey*

Abstract | PDF of the full article

The sequential probability ratio test for multidimensional adaptive testing with between-item multidimensionality

*Nicki-Nils Seitz & Andreas Frey*

Abstract | PDF of the full article

**A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies***André A. Rupp Abstract*

This paper is a systematic review of the methodology for person fit research targeted specifically at methodologists in training. I analyze the ways in which researchers in the area of person fit have conducted simulation studies for parametric and nonparametric unidimensional IRT models since the seminal review paper by Meijer and Sijtsma (2001). I specifically review how researchers have operationalized different types of aberrant responding for particular testing conditions in order to compare these simulation design characteristics with features of the real-life testing situations for which person fit analyses are officially reported. I discuss the alignment between the theoretical and practical work and the implications for future simulation work and guidelines for best practice.

*Key words:*Person fit, systematic review, aberrant responding, item response theory, simulation study, generalizability, experimental design.

*André A. Rupp*

Associate Professor

HDQM Department

EDMS Program

University of Maryland

1230-A Benjamin Building, College Park

MD 20742, USA

Associate Professor

HDQM Department

EDMS Program

University of Maryland

1230-A Benjamin Building, College Park

MD 20742, USA

*ruppandr@umd.edu*

**Robustness and power of the parametric t test and the nonparametric Wilcoxon test under non-independence of observations***Wolfgang Wiedermann & Alexander von EyeAbstract*

A large part of previous work dealt with the robustness of parametric significance tests against non-normality, heteroscedasticity, or a combination of both. The behavior of tests under violations of the independence assumption received comparatively less attention. Therefore, in applications, researches may overlook that robustness and power properties of tests can vary with the sign and the magnitude of the correlation between samples. The common paired t test is known to be less powerful in cases of negative between-group correlations. In this case, Bortz and Schuster (2010) recommend the application of the nonparametric Wilcoxon test. Using Monte-Carlo simulations, we analyzed the behavior of the t- and the Wilcoxon tests for the one- and two-sample problem under various degrees of positive and negative correlations, population distributions, sample sizes, and true differences in location. It is shown that already minimal departures from independence heavily affect Type I error rates of the two-sample tests. In addition, results for the one-sample tests clearly suggest that the sign of the underlying correlation cannot be used as a basis to decide whether to use the t test or the Wilcoxon test. Both tests show a dramatic power loss when samples are negatively correlated. Finally, in these cases, the well-known power advantage of the Wilcoxon test diminishes when distributions are skewed and samples are small.

*Key words:*robustness, power, independence assumption, t test, Wilcoxon test

*Wolfgang Wiedermann, PhD*

University of Vienna

Unit of Research Methods

Liebiggasse 5

A-1010 Vienna, Austria

University of Vienna

Unit of Research Methods

Liebiggasse 5

A-1010 Vienna, Austria

*wolfgang.wiedermann@univie.ac.at*

**The position effect in tests with a time limit: the consideration of interruption and working speed***Karl Schweizer & Xuezhu RenAbstract*

The position effect is a possible source of impairment of the structural validity of a test concerning model fit. In the case of tests with a time limit there is even a complication of the situation because of a decreasing number of participants completing the last few items of the test. Therefore, it is assumed that the appropriate representation of the position effect must additionally consider interruption due to the time limit and the effect of working speed. Interruption can be represented by the same latent variable as the position effect whereas the contribution of working speed requires another one. Confirmatory factor models including a representation of the position effect as a linear, quadratic or logarithmic increase were compared with models additionally considering interruption as a logistic decrease or simply as immediate interruption. Furthermore, there were models additionally considering working speed. In the sample of 305 participants the investigation of probability-based covariances made apparent that the modeling of interruption and also working speed substantially improved model fit. The best-fitting model was characterized by a linearly increasing representation of the position effect combined with a logistic decrease in the more difficult items and a contribution due to working speed.

*Key words:*position effect, confirmatory factor analysis, tau-equivalent model, method effect

*Karl Schweizer, PhD*

Department of Psychology

Goethe University Frankfurt

Mertonstr 17

60054 Frankfurt a. M., Germany

Department of Psychology

Goethe University Frankfurt

Mertonstr 17

60054 Frankfurt a. M., Germany

*K.Schweizer@psych.uni-frankfurt.de*

**Effect of item order on item calibration and item bank construction for computer adaptive tests***Otto B. Walter & Matthias RoseAbstract*

Item banks are typically constructed from responses to items that are presented in one fixed order; therefore, order effects between subsequent items may violate the independence assumption. We investigated the effect of item order on item bank construction, item calibration, and ability estimation. 15 polytomous items similar to items used in a pilot version of a computer adaptive test for anxiety (Walter et al., 2005; Walter et al., 2007) were presented in one fixed order or in a order randomly generated for each respondent. A total of n=520 out-patients participated in the study. Item calibration (Generalized Partial Credit Model) yielded only small differences of slope and location parameters. Simulated test runs using either the full item bank or an adaptive algorithm produced very similar ability estimates (expected a posteriori estimation). These results indicate that item order had little impact on item calibration and ability estimation for this item set.

*Key words:*item response theory; computer adaptive testing; local independence; item bank construction

*Otto B. Walter, PhD*

Universität Bielefeld

Fakultät für Psychologie und Sportwissenschaft

AE Psychologische Methodenlehre und Qualitätssicherung

Postfach 100131

33501 Bielefeld, Germany

Universität Bielefeld

Fakultät für Psychologie und Sportwissenschaft

AE Psychologische Methodenlehre und Qualitätssicherung

Postfach 100131

33501 Bielefeld, Germany

*otto.walter@charite.de*

**Too hard, too easy, or just right? The relationship between effort or boredom and ability-difficulty fit***Regine Asseburg & Andreas FreyAbstract*

Usually, it is assumed that achievement tests measure maximum performance. However, test performance is not only associated with ability but also with motivational and emotional aspects of test-taking. These aspects are influenced by individual success probability, which in turn depends on the ratio of individual ability to item difficulty (ability-difficulty fit). The impact of ability-difficulty fit on test-taking motivation and emotion is unknown and rarely considered when interpreting test results.

N = 9,452 ninth-graders in Germany (PISA 2006) completed a mathematics test and a questionnaire on test-taking effort (motivation) and boredom/daydreaming (emotion). Overall, mean item difficulty exceeded individual ability. Ability-difficulty fit was positively linear related with effort and boredom/daydreaming.

The results suggest that low ability students may not show maximum performance in a sequential achievement test. Thus, test score interpretation for this subsample may be invalid. As a solution to this problem the application of computerized adaptive testing is discussed.

*Key words:*achievement test, test-taking, effort, boredom, Performance

*Regine Asseburg, PhD*

Leibniz Institute for Science and Mathematics

Education at the University of Kiel (IPN)

Germany

Leibniz Institute for Science and Mathematics

Education at the University of Kiel (IPN)

Germany

*asseburg@ipn.uni-kiel.de*

**The sequential probability ratio test for multidimensional adaptive testing with between-item multidimensionality***Nicki-Nils Seitz & Andreas FreyAbstract*

It is examined whether the unidimensional Sequential Probability Ratio Test (SPRT) can be pro-ductively combined with multidimensional adaptive testing (MAT). With a simulation study, it is investigated whether this combination results in more accurate simultaneous classifications on two or three dimensions compared to several instances of unidimensional adaptive testing (UCAT) in combination with SPRT. The number of cut scores, and the correlation between the dimensions measured were varied. The average test length was mainly influenced by the number of cut scores (one, four) and the adaptive algorithm (MAT, UCAT). With MAT, a lower average test length was achieved in comparison to the UCAT. It is concluded that MAT will result in a higher percentage of correct classifications than UCAT when more than two dimensions are measured.

*Key words:*classification, computerized adaptive testing, item response theory, multidimensional adaptive testing, sequential probability ratio test

*Nicki-Nils Seitz*

Institute of Educational Science

Department of Research Methods in Education

Friedrich-Schiller-University Jena

Am Planetarium 4

07737 Jena, Germany

Institute of Educational Science

Department of Research Methods in Education

Friedrich-Schiller-University Jena

Am Planetarium 4

07737 Jena, Germany

*nicki-nils.seitz@uni-jena.de*