A General Method for Estimating the Classification Reliability of Complex Decisions Based on Configural Combinations of Multiple Assessment Scores
MetadataShow full item record
This study presents a general method for estimating the classification reliability of complex decisions based on multiple scores from a single test administration. The proposed method consists of four steps that can be applied to a variety of measurement models and configural rules for combining test scores: Step 1: Fit a measurement model to the observed data. Step 2: Simulate replicate distributions of plausible observed scores based on the measurement model. Step 3: Construct a contingency table that shows the congruence between true and replicate scores for decision accuracy, and two replicate scores for decision consistency. Step 4: Calculate measures to characterize agreement in the contingency tables. Using a classical test theory model, a simulation study explores the effect of increasing the number of tests, strength of relationship among tests, and number of opportunities to pass on classification accuracy and consistency. Next the model is applied to actual data from the GED Testing Service to illustrate the utility of the method for informing practical decisions. Simulation results support the validity of the method for estimating classification reliability, and the method provides credible estimation of classification reliability for the GED Tests. Application of configural rules results in complex findings which sometimes show different results for classification accuracy and consistency. Unexpected findings support the value of using the method to explore classification reliability as a means of improving decision rules. Highlighted findings: 1) The compensatory rule (in which test scores are added) performs consistently well across almost all conditions; 2) Conjunctive and complementary rules frequently show opposite results; 3) Including more tests in the decision rule influences classification reliability differently depending on the rule; 4) Combining scores from highly-related tests increases classification reliability; 5) Providing multiple opportunities to pass yields mixed results. Future studies are suggested to explore use of other measurement models, varying levels of test reliability, modeling multiple attempts in which learning occurs between testings; and in-depth study of incorrectly classified examinees.