Full Text Search - False-positive Problem

False-positive Problem

Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "football", clustering can be used to categorize the document/data universe into "American football", "corporate football", etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

Read more about this topic:  Full Text Search

Famous quotes containing the word problem:

    Great speeches have always had great soundbites. The problem now is that the young technicians who put together speeches are paying attention only to the soundbite, not to the text as a whole, not realizing that all great soundbites happen by accident, which is to say, all great soundbites are yielded up inevitably, as part of the natural expression of the text. They are part of the tapestry, they aren’t a little flower somebody sewed on.
    Peggy Noonan (b. 1950)