Reliability (psychometrics) - Estimation

Estimation

The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.

Four practical strategies have been developed that provide workable methods of estimating test reliability.

1. Test-retest reliability method: directly assesses the degree to which test scores are consistent from one test administration to the next.

It involves:

  • Administering a test to a group of individuals
  • Re-administering the same test to the same group at some later time
  • Correlating the first set of scores with the second

The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the Pearson product-moment correlation coefficient: see also item-total correlation.

2. Parallel-forms method:

The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics. For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent.

With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a person’s true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A and form B may be due to errors in measurement only.

It involves:

  • Administering one form of the test to a group of individuals
  • At some later time, administering an alternate form of the same test to the same group of people
  • Correlating scores on form A with scores on form B

The correlation between scores on the two alternate forms is used to estimate the reliability of the test.

This method provides a partial solution to many of the problems inherent in the test-retest reliability method. For example, since the two forms of the test are different, carryover effect is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test.

However, this technique has its disadvantages:

  • It may very difficult to create several alternate forms of a test
  • It may also be difficult if not impossible to guarantee that two alternate forms of a test are parallel measures

3. Split-half method:

This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the parallel-forms method faces: the difficulty in developing alternate forms.

It involves:

  • Administering a test to a group of individuals
  • Splitting the test in half
  • Correlating scores on one half of the test with scores on the other half of the test

The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the Spearman–Brown prediction formula.

There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through 40. However, the responses from the first half may be systematically different from responses in the second half due to an increase in item difficulty and fatigue.

In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test.

4. Internal consistency: assesses the consistency of results across items within a test. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients. Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder–Richardson Formula 20. Although the most commonly used, there are some misconceptions regarding Cronbach's alpha.

These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all types—yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, and other informal means. However, formal psychometric analysis, called item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.

  • (where is the failure rate)

Read more about this topic:  Reliability (psychometrics)

Famous quotes containing the word estimation:

    No man ever stood lower in my estimation for having a patch in his clothes; yet I am sure that there is greater anxiety, commonly, to have fashionable, or at least clean and unpatched clothes, than to have a sound conscience.
    Henry David Thoreau (1817–1862)

    ... it would be impossible for women to stand in higher estimation than they do here. The deference that is paid to them at all times and in all places has often occasioned me as much surprise as pleasure.
    Frances Wright (1795–1852)

    A higher class, in the estimation and love of this city- building, market-going race of mankind, are the poets, who, from the intellectual kingdom, feed the thought and imagination with ideas and pictures which raise men out of the world of corn and money, and console them for the short-comings of the day, and the meanness of labor and traffic.
    Ralph Waldo Emerson (1803–1882)