Information Retrieval Facility - Research Collections

Research Collections

The IRF provides a number of test data collections that have either been developed by the IRF, by one of its members or by third parties. These data collections can be used freely for scientific experimentations.

The MAtrixware REsearch Collection (MAREC) is the first standardised patent data corpus for research purposes. It consists of 19 million patent documents in different languages, normalised to a highly specific XML format. The collection has been developed by Matrixware for the IRF.

The ClueWeb09 collection is a 25 terabyte dataset of about 1 billion web pages crawled in January and February, 2009. It has been created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies.

Read more about this topic:  Information Retrieval Facility

Famous quotes containing the words research and/or collections:

    Our science has become terrible, our research dangerous, our findings deadly. We physicists have to make peace with reality. Reality is not as strong as we are. We will ruin reality.
    Friedrich Dürrenmatt (1921–1990)

    Most of those who make collections of verse or epigram are like men eating cherries or oysters: they choose out the best at first, and end by eating all.
    —Sébastien-Roch Nicolas De Chamfort (1741–1794)