Memo 1: Guessing | Memo 2: Difficulty | Memo 3: Essays
Memo 4: Multiple Choice 1 | Memo 5: Multiple Choice 2
Memo 6: Averaging Grades | Memo 7: Assigning Grades
Memo 8: Reliability | Memo 9: Missed Test
Memo 10: Multiple Choice 3 | Memo 11: Absolute/Relative Grading
Robert B. Frary
Reliability is a measure of the stability of test scores. Suppose a test is administered to the same group of examinees on successive days with no intervening instruction in the area tested. If the test scores are highly stable (or reliable), each examinee will get the same or close to the same score on both administrations. The closer the pairs of scores are, the more stable or reliable they are over time.
Of course, the passage of time is not the only reason that test scores differ from one administration to another. Transitory examinee characteristics (e.g., feeling better or worse than usual) have their effect as do administration conditions (e.g., noise level, temperature, and ventilation). For multiple-choice tests, variation in luck when guessing can yield score differences from one administration to another. Reliability is affected by all such factors, and it is usually not possible to determine the relative contribution of each one separately.
A reliability coefficient is a Pearson product-moment correlation coefficient between two sets of scores as described above. Correlation coefficients range from -1 through 0 to +1, but negative values are not meaningful with respect to reliability. A coefficient of 0 means that there is no relationship between the two sets of scores. A coefficient of 1 would occur if all examinees got the same score on both administrations. A reliability coefficient (signified rxx) may be evaluated (roughly) as follows:
rxx = .90 or higher - High reliability.
Suitable for making a decision about an examinee based on a single test score.rxx = .80 to .89 - Good reliability.
Suitable for use in evaluating individual examinees if averaged with a small number of other scores of similar reliability.rxx = .60 to .79 - Low to moderate reliability.
Suitable for evaluating individuals only if averaged with several other scores of similar reliability.rxx = .40 to .59 - Doubtful reliability.
Should be used only with caution in the evaluation of individual examinees. May be satisfactory for determination of average score differences between groups.
It is rarely feasible to administer a test twice to evaluate reliability. Nevertheless, the responses from a single administration of a test can be used to estimate reliability. These estimates are called internal consistency reliability coefficients. Measurement and Research score reports include one of these, the Kuder-Richardson formula 20 (KR20) coefficient. Computation of internal consistency reliability coefficients is based on assuming that the test is not speeded and that its content is homogeneous. These coefficients overestimate reliability if there is not sufficient time for nearly all examinees to finish. They underestimate reliability if the test content is not homogeneous. For example, suppose that a test contains questions covering two distinct areas of a course and that the number of correct answers for a student in one area is not a very good indicator of how well that student will do in the other area. In this case, rather than combine two disparate topics on the same test, it is better to separate the questions into two subtests generating separate scores and reliability coefficients.
All reliability estimates are subject to considerable error when there are small numbers of examinees or test items. If there are fewer than, say, 25 examinees or 10 items, the reliability estimate must be "taken with a grain of salt." This phenomenon is especially noticeable when there are several scrambled forms of the test, each administered to a relatively small number of examinees. Then the KR20 coefficients may fluctuate considerably from one form to another. In this case, the instructor may wish to have the responses unscrambled and evaluated as if they came from a single form. Measurement and Research will provide this analysis based on the instructor's unscrambling key.
In many cases a test will contain items for which the correlation between selection of the keyed answer and total score is very weak or negative. This outcome suggests that such items do not relate to the content measured by the other items. If items with this characteristic are dropped from the test, KR20 will invariably improve. This phenomenon is discussed in more detail in Testing Memo 5.
For classroom testing, the most common cause of low reliability is test questions that are too easy. When all or nearly all of the questions are answered correctly by more than, say, 80% of the examinees, the resulting scores will be in a narrow range. For example, under such circumstances, most of the scores on a 50-question test will lie between 40 and 50. Then a small score fluctuation due to extraneous circumstances (such as those discussed above) will have a large relative effect on the class standing of the examinee. On the 50-item test just described, the range for As might be 47-50. Then a student with a headache, who would otherwise make an A, may miss one or two extra questions and make a B. At the same time, a B student who has just moderately good luck when guessing may make an A. These kinds of errors give rise to low reliability coefficients. If the test questions are harder, the scores will be more spread out and reliability will be higher. Assuming that the same numbers of each letter grade will be given, small errors in scores are then less likely to result in different grades. TESTING MEMO 2 discusses test difficulty in greater detail.
The standard error of measurement (SEM) is related to reliability--the lower the reliability, the larger the SEM. This statistic is included routinely in Measurement and Research score reports and reflects the extent to which an individual examinee's scores on many (hypothetical) administrations of a test would fluctuate about that examinee's average score over the many administrations. Of course, it is that average or " true" score that should be used to evaluate the examinee. If the assumptions for KR20 are met, then the odds are about 2 to 1 (or the probability is about 2/3) that the examinee's "true" score is contained within one SEM (above and below) of his or her actual score on the test. Recognition of how far the "true" score of each examinee might be from his or her actual score may suggest some liberality in determining the cut points between letter grades.
Note that in all of the above discussion, only scores were described as being more or less reliable. Tests per se, that is the instruments themselves, cannot be described in this way. A test may yield highly reliable scores under one set of circumstances and scores of low reliability under another. Factors to be considered in this regard are administration conditions, appropriateness and difficulty of the test for the examinees, and examinee motivation and attitude. When these factors are unfavorable, the scores are likely to be less reliable than otherwise.