Tf*idf - Motivation

Motivation

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

Read more about this topic:  Tf*idf

Famous quotes containing the word motivation:

    Self-determination has to mean that the leader is your individual gut, and heart, and mind or we’re talking about power, again, and its rather well-known impurities. Who is really going to care whether you live or die and who is going to know the most intimate motivation for your laughter and your tears is the only person to be trusted to speak for you and to decide what you will or will not do.
    June Jordan (b. 1939)