Language Model - Unigram Models

Unigram Models

A unigram model used in information retrieval can be treated as the combination of several one-state finite automata. It splits the probabilities of different terms in a context, e.g. from to .

In this model, the probability to hit each word all depends on its own, so we only have one-state finite automata as units. For each automaton, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1. Followed is an illustration of an unigram model of a document.

Terms Probability in doc
a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...

The probability generated for a specific query is calculated as

For different documents, we can build their own unigram models, with different hitting probabilities of words in it. And we use probabilities from different documents to generate different hitting probabilities for a query. Then we can rank documents for a query according to the generating probabilities. Next is an example of two unigram models of two documents.

Terms Probability in Doc1 Probability in Doc2
a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

In information retrieval contexts, unigram language models are often smoothed to avoid instances where . A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to create a smoothed document model.

Read more about this topic:  Language Model

Famous quotes containing the word models:

    Friends broaden our horizons. They serve as new models with whom we can identify. They allow us to be ourselves—and accept us that way. They enhance our self-esteem because they think we’re okay, because we matter to them. And because they matter to us—for various reasons, at various levels of intensity—they enrich the quality of our emotional life.
    Judith Viorst (20th century)