n-gram - n-gram Models | <i>n< I>-gram Models

n-gram Models

An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

This idea can be traced to an experiment by Claude Shannon's work in information theory. Shannon posed the question: given a sequence of letters (for example, the sequence "for ex"), what is the likelihood of the next letter? From training data, one can derive a probability distribution for the next letter given a history of size : a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0.

More concisely, an n-gram model predicts based on . In probability terms, this is . When used for language modeling, independence assumptions are made so that each word depends only on the last n-1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.

Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a "multinomial distribution").

In practice, however, one should smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. See Smoothing techniques for details.

Famous quotes containing the word models:

“... your problem is your role models were models.”
—Jane Wagner (b. 1935)