Cross-validation (statistics) - Purpose of Cross Validation

Purpose of Cross Validation

Suppose we have a model with one or more unknown parameters, and a data set to which the model can be fit (the training data set). The fitting process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available.

Linear regression provides a simple illustration of overfitting. In linear regression we have real response values Y1, ..., Yn, and vector covariates X1, ..., Xp. We can use least squares to fit a hyperplane a + b1X1 + ... + bpXp between the Y and X data, and then assess the fit using the mean squared error (MSE)


\sum_i (Y_i - a - b_1X_{1i} - \cdots - b_pX_{pi})^2/n,

where Xji is the value of variable Xj corresponding to the ith response value Yi.

It can be shown under mild assumptions that the expected value of the MSE for the training set is (np − 1)/(n + p + 1) < 1 times the expected value of the MSE for the validation set (the expected value is taken over the distribution of training sets). Thus if we fit the model and compute the MSE on the training set, we will get an optimistically biased assessment of how well the model will fit an independent data set. This biased estimate is called the in-sample estimate of the fit, whereas the cross-validation estimate is an out-of-sample estimate.

Since in linear regression it is possible to directly compute the factor (np − 1)/(n + p + 1) by which the training MSE underestimates the validation MSE, cross-validation is not practically useful in that setting. However in most other regression procedures (e.g. logistic regression), there is no simple formula to make this adjustment. Cross-validation is a generally applicable way to predict the performance of a model on a validation set using computation in place of mathematical analysis.

Read more about this topic:  Cross-validation (statistics)

Famous quotes containing the words purpose of, purpose and/or cross:

    As peace is the end of war, so to be idle is the ultimate purpose of the busy.
    Samuel Johnson (1709–1784)

    Nowadays, if New York has a heart, it might be the Garden. Almost everyone goes there, for one purpose or another. There are dog shows, and Sonja Henie and mass meetings.
    In New York City, U.S. public relief program (1935-1943)

    All pathways by His feet are worn,
    His strong heart stirs the ever-beating sea;
    His crown of thorns is twined with every thorn;
    His cross is every tree.
    Joseph Mary Plunkett (1887–1916)