Cross-validation (statistics) - Purpose of Cross Validation

Purpose of Cross Validation

Suppose we have a model with one or more unknown parameters, and a data set to which the model can be fit (the training data set). The fitting process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available.

Linear regression provides a simple illustration of overfitting. In linear regression we have real response values Y1, ..., Yn, and vector covariates X1, ..., Xp. We can use least squares to fit a hyperplane a + b1X1 + ... + bpXp between the Y and X data, and then assess the fit using the mean squared error (MSE)


\sum_i (Y_i - a - b_1X_{1i} - \cdots - b_pX_{pi})^2/n,

where Xji is the value of variable Xj corresponding to the ith response value Yi.

It can be shown under mild assumptions that the expected value of the MSE for the training set is (np − 1)/(n + p + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets). Thus if we fit the model and compute the MSE on the training set, we will get an optimistically biased assessment of how well the model will fit an independent data set. This biased estimate is called the in-sample estimate of the fit, whereas the cross-validation estimate is an out-of-sample estimate.

Since in linear regression it is possible to directly compute the factor (np − 1)/(n + p + 1) by which the training MSE underestimates the validation MSE, cross-validation is not practically useful in that setting. However in most other regression procedures (e.g. logistic regression), there is no simple formula to make this adjustment. Cross-validation is a generally applicable way to predict the performance of a model on a validation set using computation in place of mathematical analysis.

Read more about this topic:  Cross-validation (statistics)

Famous quotes containing the words purpose of, purpose and/or cross:

    The purpose of education is to keep a culture from being drowned in senseless repetitions, each of which claims to offer a new insight.
    Harold Rosenberg (1906–1978)

    I envy neither the heart nor the head of any legislator who has been born to an inheritance of privileges, who has behind him ages of education, dominion, civilization, and Christianity, if he stands opposed to the passage of a national education bill, whose purpose is to secure education to the children of those who were born under the shadow of institutions which made it a crime to read.
    Frances Ellen Watkins Harper (1825–1911)

    I’d take off all my clothes
    & cross the damp cold lawn & down the bluff
    into the terrible water & walk forever
    under it out toward the island.
    John Berryman (1914–1972)