A Proposed Index to Detect Relative Item Performance when the Focal Group Sample Size is Small

Hansen, Kari

A Proposed Index to Detect Relative Item Performance when the Focal Group Sample Size is Small

Files

Hansen_umd_0117E_18469.pdf (3.8 MB)

No. of downloads: 101

Date

2017

Authors

Hansen, Kari

Advisor

Stapleton, Laura M
Jiao, Hong

DRUM DOI

https://doi.org/10.13016/M2MS3K334

Abstract

When developing educational assessments, ensuring that the test is fair to all groups of examinees is an essential part of the process. The primary statistical method for identifying potential bias in assessments is known as differential item functioning (DIF) analysis, where DIF refers to differences in performance on a specific test item between two groups assuming that the two groups have an overlap in their ability distribution. However, this requirement may be less likely to be feasible if the sample size for the focal group is small.

A new index, relative item performance, is proposed to address the issue of small focal group sample sizes without the requirement of an overlap in ability distribution. This index is calculated by obtaining the effect size of the difference in item difficulty estimates between the two groups. A simulation study was conducted to compare the proposed method with the Mantel-Haenszel test with score group widths and the Differential Item Pair Functioning in terms of Type I error rates and power. The following factors were manipulated: the sample size of the focal group, the mean of the ability distribution, the amount of DIF, the number of items on the assessment, and the number of items that have different item difficulties.

For all three methods, the main factors that affect the Type I error rates are the amount of item contamination, the size of the DIF, the ability mean for the focal group, and the item parameters. The sample size and the number of items were found not to have an effect on the Type I error rates for all methods. As the Type I error rate overall for the RI method is much lower than that of the MH1 and MH2 methods and not controlled across the simulation factors, power was only evaluated for the MH1 and MH2 methods. The median power of these methods were .203 and .181, respectively. It is recommended that the MH1 and MH2 methods be used only when the sample size is larger than 100 and in conjunction with expert and cognitive review of the items on the assessment.

URI (handle)

http://hdl.handle.net/1903/20282

Collections

UMD Theses and Dissertations
Human Development & Quantitative Methodology Theses and Dissertations

Full item page