Cross-validation (statistics) - Limitations and Misuse

Limitations and Misuse

Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population. In many applications of predictive modeling, the structure of the system being studied evolves over time. This can introduce systematic differences between the training and validation sets. For example, if a model for predicting stock values is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population. As another example, suppose a model is developed to predict an individual's risk for being diagnosed with a particular disease within the next year. If the model is trained using data from a study involving only a specific population group (e.g. young people or males), but is then applied to the general population, the cross-validation results from the training set could differ greatly from the actual predictive performance.

If carried out properly, and if the validation set and training set are from the same population, cross-validation is nearly unbiased. However there are many ways that cross-validation can be misused. If it is misused and a true validation study is subsequently performed, the prediction errors in the true validation are likely to be much worse than would be expected based on the results of cross-validation.

These are some ways that cross-validation can be misused:

  • By using cross-validation to assess several models, and only stating the results for the model with the best results.
  • By performing an initial analysis to identify the most informative features using the entire data set – if feature selection or model tuning is required by the modeling procedure, this must be repeated on every training set. If cross-validation is used to decide which features to use, an inner cross-validation to carry out the feature selection on every training set must be performed.
  • By allowing some of the training data to also be included in the test set – this can happen due to "twinning" in the data set, whereby some exactly identical or nearly identical samples are present in the data set.

It should be noted that some statisticians have questioned the usefulness of validation samples.

Read more about this topic:  Cross-validation (statistics)

Famous quotes containing the words limitations and/or misuse:

    The only rules comedy can tolerate are those of taste, and the only limitations those of libel.
    James Thurber (1894–1961)

    Almost all scholarly research carries practical and political implications. Better that we should spell these out ourselves than leave that task to people with a vested interest in stressing only some of the implications and falsifying others. The idea that academics should remain “above the fray” only gives ideologues license to misuse our work.
    Stephanie Coontz (b. 1944)