Cross-validation (statistics) - Statistical Properties

Statistical Properties

Suppose we choose a measure of fit F, and use cross-validation to produce an estimate F* of the expected fit EF of a model to an independent data set drawn from the same population as the training data. If we imagine sampling multiple independent training sets following the same distribution, the resulting values for F* will vary. The statistical properties of F* result from this variation.

The cross-validation estimator F* is very nearly unbiased for EF. The reason that it is slightly biased is that the training set in cross-validation is slightly smaller than the actual data set (e.g. for LOOCV the training set size is n − 1 when there are n observed cases). In nearly all situations, the effect of this bias will be conservative in that the estimated fit will be slightly biased in the direction suggesting a poorer fit. In practice, this bias is rarely a concern.

The variance of F* can be large. For this reason, if two statistical procedures are compared based on the results of cross-validation, it is important to note that the procedure with the better estimated performance may not actually be the better of the two procedures (i.e. it may not have the better value of EF). Some progress has been made on constructing confidence intervals around cross-validation estimates, but this is considered a difficult problem.

Read more about this topic:  Cross-validation (statistics)

Famous quotes containing the word properties:

    The reason why men enter into society, is the preservation of their property; and the end why they choose and authorize a legislative, is, that there may be laws made, and rules set, as guards and fences to the properties of all the members of the society: to limit the power, and moderate the dominion, of every part and member of the society.
    John Locke (1632–1704)