Problem and Data Sets
Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form . The user and movie fields are integer IDs, while grades are from 1 to 5 (integral) stars.
The qualifying data set contains over 2,817,131 triplets of the form , with grades known only to the jury. A participating team's algorithm must predict grades on the entire qualifying set, but they are only informed of the score for half of the data, the quiz set of 1,408,342 ratings. The other half is the test set of 1,408,789, and performance on this is used by the jury to determine potential prize winners. Only the judges know which ratings are in the quiz set, and which are in the test set—this arrangement is intended to make it difficult to hill climb on the test set. Submitted predictions are scored against the true grades in terms of root mean squared error (RMSE), and the goal is to reduce this error as much as possible. Note that while the actual grades are integers in the range 1 to 5, submitted predictions need not be. Netflix also identified a probe subset of 1,408,395 ratings within the training data set. The probe, quiz, and test data sets were chosen to have similar statistical properties.
In summary, the data used in the Netflix Prize looks as follows:
- Training set (99,072,112 ratings)
- Probe set (1,408,395 ratings)
- Qualifying set (2,817,131 ratings) consisting of:
- Test set (1,408,789 ratings), used to determine winners
- Quiz set (1,408,342 ratings), used to calculate leaderboard scores
For each movie, title and year of release are provided in a separate dataset. No information at all is provided about users. In order to protect the privacy of customers, "some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates".
The training set is such that the average user rated over 200 movies, and the average movie was rated by over 5000 users. But there is wide variance in the data—some movies in the training set have as few as 3 ratings, while one user rated over 17,000 movies.
There was some controversy as to the choice of RMSE as the defining metric. Would a reduction of the RMSE by 10% really benefit the users? It has been claimed that even as small an improvement as 1% RMSE results in a significant difference in the ranking of the "top-10" most recommended movies for a user.
Read more about this topic: Netflix Prize
Famous quotes containing the words problem and, problem, data and/or sets:
“Give a scientist a problem and he will probably provide a solution; historians and sociologists, by contrast, can offer only opinions. Ask a dozen chemists the composition of an organic compound such as methane, and within a short time all twelve will have come up with the same solution of CH4. Ask, however, a dozen economists or sociologists to provide policies to reduce unemployment or the level of crime and twelve widely differing opinions are likely to be offered.”
—Derek Gjertsen, British scientist, author. Science and Philosophy: Past and Present, ch. 3, Penguin (1989)
“It is commonplace that a problem stated is well on its way to solution, for statement of the nature of a problem signifies that the underlying quality is being transformed into determinate distinctions of terms and relations or has become an object of articulate thought.”
—John Dewey (18591952)
“To write it, it took three months; to conceive it three minutes; to collect the data in itall my life.”
—F. Scott Fitzgerald (18961940)
“The vain man does not wish so much to be prominent as to feel himself prominent; he therefore disdains none of the expedients for self-deception and self-outwitting. It is not the opinion of others that he sets his heart on, but his opinion of their opinion.”
—Friedrich Nietzsche (18441900)