A statistical language model assigns a probability to a sequence of m words by means of a probability distribution.
Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.
In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
When used in information retrieval, a language model is associated with a document in a collection. With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md). The method to use language models in information retrieval is the query likelihood model.
In practice, unigram language models are most commonly used in information retrieval, as they are sufficient to determine the topic from a piece of text. Unigram models only calculate the probability of hitting an isolated word, without considering any influence from the words before or after the target. This leads to the Bag of words model, and turns out to generate a multinomial distribution over words.
Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models.
Read more about Language Model: Unigram Models, N-gram Models, Other Models
Famous quotes containing the words language and/or model:
“This is an approach to that universal language which men have sought in vain.”
—Henry David Thoreau (18171862)
“One of the most important things we adults can do for young children is to model the kind of person we would like them to be.”
—Carol B. Hillman (20th century)