Full Text Search - False-positive Problem

False-positive Problem

Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "football", clustering can be used to categorize the document/data universe into "American football", "corporate football", etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

Read more about this topic:  Full Text Search

Famous quotes containing the word problem:

    It is very comforting to believe that leaders who do terrible things are, in fact, mad. That way, all we have to do is make sure we don’t put psychotics in high places and we’ve got the problem solved.
    Tom Wolfe (b. 1931)