Memo 1: Guessing | Memo 2: Difficulty | Memo 3: Essays
Memo 4: Multiple Choice 1 | Memo 5: Multiple Choice 2
Memo 6: Averaging Grades | Memo 7: Assigning Grades
Memo 8: Reliability | Memo 9: Missed Test
Memo 10: Multiple Choice 3 | Memo 11: Absolute/Relative Grading
Jerard F. Kehoe
In TESTING MEMO 4 several guidelines for writing multiple-choice items were described. This memo offers some suggestions for the further improvement of multiple-choice tests using statistical data provided by the Measurement and Research Services test scoring service. The basic idea that we can capitalize on is that the psychometric behavior of "bad" items is often different from that of "good" items. Unfortunately, the items need to be administered to students in order to identify the "good" and "bad" items. This fact underscores our point of view that tests can be improved by maintaining and developing a pool of "good" items from which future tests will be drawn in part or in whole. This is particularly true for those teachers who teach the same course more than once.
What Makes an Item Psychometrically Good?
In answering this question, it is desirable to restrict our discussion to tests which are written to cover a unified development of course material such that it is unlikely that a student would do well on one part of a test and poorly on another. If this latter situation is the case, the comments which follow will apply only if the corresponding topics are tested separately. Regardless, this approach would be preferred, since, otherwise, scores would be ambiguous in their reporting of students' achievement.
Once the instructor is satisfied that the test items meet the above criterion and that they are indeed appropriately written (see TESTING MEMO 4), what remains is to evaluate the extent to which they discriminate students. The degree to which this goal is attained is the basic measure of item quality for almost all college-level multiple-choice tests (see TESTING MEMO 2). For each item the primary indicator of its power to discriminate students is the correlation coefficient reflecting the tendency of students selecting the correct answer to have high scores. This coefficient is reported under the right answer on the third row for each item on the item-analysis portion of the printout provided by Measurement and Research Services.
ITEM | CHOICE NUMBERS | |||||||
NO. | KEY | OMITTED | 1 | 2 | 3 | 4 | 5 | |
8 | 3 | 0 | 7 | 13 | 87* | 6 | 3 | |
0.0 | 0.06 | 0.11 | 0.75* | 0.05 | 0.03 | |||
0.0 | -0.08 | -0.18 | 0.29* | -0.13 | -0.13 | |||
The coefficient for the right answer should be positive, indicating that students selecting this choice tend to have high scores. The coefficients for the wrong choices should be negative, which means that students selecting these choices tend to have low scores.
The proportion of students answering an item correctly also affects its discrimination power. This point was discussed in detail in TESTING MEMO 2 and may be summarized by saying that items answered correctly (or incorrectly) by a large proportion of examinees (more than 85%) have markedly reduced power to discriminate. On a good test, most items will be answered correctly by 30% to 80% of examinees.
A general indicator of test quality is the reliability estimate (KR-20) reported on the test scoring/analysis printout. It reflects the extent to which the test would yield the same ranking of examinees if readministered with no effect from the first administration, in other words, its accuracy or power of discrimination. Values of as low as .5 are satisfactory for short tests (10 - 15 items), while tests with over 50 items should yield KR-20 values of .8 or higher (1.0 is the maximum). In any event, important decisions concerning individual students should not be based on a single test score when the corresponding KR-20 is less than .8. Unsatisfactorily low KR-20s are usually due to an excess of very easy (or hard) items, poorly written items that do not discriminate, or violation of the precondition that the items test a unified body of content.
Following are some suggested steps for improving the ability of items to discriminate.
Item Development
Since the test scoring service provides the pertinent information as a part of its usual output, it is a relatively simple task to keep a record of each item and its performance. A suggestion is to tape a copy of each item on a 5 x 7 card with the test content area briefly described on top. In addition, tape the corresponding line from the computer printout for that item each time it is used.
A few basic rules for item development follow.
1. Items that correlate less than .15 with total test score should probably be restructured. One's best guess is that such items do not measure the same skill or ability as does the test, on the whole. Generally, a test is better (i.e., more reliable) the more homogeneous the items. Just how to restructure the item depends largely on careful thinking at this level. Begin by applying the rules of stem and option construction discussed in TESTING MEMO 4. If there are any apparent violations, correct them on the same card. Otherwise, it's probably best to write a new item altogether after considering whether the content of the item is similar to the content objectives of the test.
2. Distractors that are not chosen by any examinees should be replaced or eliminated. They are not contributing to the test's ability to discriminate the good students from the poor students. One should not be concerned if each distractor is not chosen by the same number of examinees. Different kinds of mistakes may very well be made by different numbers of students. Also, the fact that most students miss an item does not imply that the item should be changed, although such items should be double-checked for their accuracy. One should be suspicious about the correctness of any item in which a single distractor is chosen more often than all other options, including the answer, and especially so if its correlation with the total score is positive (line 3 of item analysis printout).
3. Items that virtually everyone gets right are useless for discriminating among students and should be replaced by more difficult items. This recommendation is particularly true if you adopt the traditional attitude toward letter grade assignments that letter grades more or less fit a predetermined distribution. (See TESTING MEMO 2 and 7 for further discussions of this point.)
By constructing, recording and adjusting items in this fashion, teachers can develop a pool of items for specific content areas with conveniently available resources.
Some Further Issues
The suggestions here focus on the development of test which are homogeneous, that is, tests intended to measure a unified content area. Only for such tests is it reasonable to maximize item-test correlations or, equivalently, KR-20 (reliability), which is the objective of step 1. above. The extent to which a high average item-test correlation can be achieved depends to some extent on the content area. It is generally acknowledged that well constructed tests is vocabulary or mathematics are more homogeneous than well constructed tests in social sciences. This circumstance suggests that particular content areas have optimal levels of homogeneity and that these vary from discipline to discipline. Perhaps psychologists should strive for lower test homogeneity than mathematicians because course content is less homogeneous.
A second issue involving test homogeneity is that of the precision of a student's obtained test score as an estimate of that student's "true"score on the skill tested. Precision (reliability) increases as the average item-test correlation increases, all else the same; and precision decreases as the number of items decreases, all else the same. These two relationships lead to an interesting paradox: often the precision of a test can be increased simply by discarding the items with low item-test correlations. For example, a 30-item multiple-choice test recently administered by the author resulted in a reliability of .79, and discarding the seven items with item-test correlations below .20 yielded a 23-item test with a reliability of .88! That is, by dropping the worst items from the test, the students' obtained scores on the shorter version are judged to be more precise estimates than the same students' obtained scores on the longer version. The reader may question whether it is ethical to throw out poorly performing questions when some students may have answered them correctly based on their knowledge of course material. Our opinion is that this practice is completely justified. The purpose of testing is to determine each student's rank. Retaining psychometrically unsatisfactory questions is contrary to this goal and degrades the accuracy of the resulting ranking.