Cross-validation (statistics) - Limitations and Misuse

Limitations and Misuse

Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population. In many applications of predictive modeling, the structure of the system being studied evolves over time. This can introduce systematic differences between the training and validation sets. For example, if a model for predicting stock values is trained on data for a certain five-year period, it is unrealistic to treat the subsequent five-year period as a draw from the same population. As another example, suppose a model is developed to predict an individual's risk for being diagnosed with a particular disease within the next year. If the model is trained using data from a study involving only a specific population group (e.g. young people or males), but is then applied to the general population, the cross-validation results from the training set could differ greatly from the actual predictive performance.

If carried out properly, and if the validation set and training set are from the same population, cross-validation is nearly unbiased. However there are many ways that cross-validation can be misused. If it is misused and a true validation study is subsequently performed, the prediction errors in the true validation are likely to be much worse than would be expected based on the results of cross-validation.

These are some ways that cross-validation can be misused:

  • By using cross-validation to assess several models, and only stating the results for the model with the best results.
  • By performing an initial analysis to identify the most informative features using the entire data set – if feature selection or model tuning is required by the modeling procedure, this must be repeated on every training set. If cross-validation is used to decide which features to use, an inner cross-validation to carry out the feature selection on every training set must be performed.
  • By allowing some of the training data to also be included in the test set – this can happen due to "twinning" in the data set, whereby some exactly identical or nearly identical samples are present in the data set.

It should be noted that some statisticians have questioned the usefulness of validation samples.

Read more about this topic:  Cross-validation (statistics)

Famous quotes containing the words limitations and/or misuse:

    Much of what contrives to create critical moments in parenting stems from a fundamental misunderstanding as to what the child is capable of at any given age. If a parent misjudges a child’s limitations as well as his own abilities, the potential exists for unreasonable expectations, frustration, disappointment and an unrealistic belief that what the child really needs is to be punished.
    Lawrence Balter (20th century)

    I ... must continue to strive for more knowledge and more power, though the new knowledge always contradicts the old and the new power is the destruction of the fools who misuse it.
    George Bernard Shaw (1856–1950)