Bayesian Spam Filtering - Process

Process

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.

The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam.

Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

Read more about this topic:  Bayesian Spam Filtering

Famous quotes containing the word process:

    To exist as an advertisement of her husband’s income, or her father’s generosity, has become a second nature to many a woman who must have undergone, one would say, some long and subtle process of degradation before she sunk [sic] so low, or grovelled so serenely.
    Elizabeth Stuart Phelps (1844–1911)

    The process of education in the oldest profession in the world is like any other educational process, in that it requires time and effort and patience; it can only be acquired by taking one step at a time, though the steps become accelerated after the first few.
    Madeleine [Blair], U.S. prostitute and “madam.” Madeleine, ch. 4 (1919)

    It is part of the nature of consciousness, of how the mental apparatus works, that free reason is only a very occasional function of people’s “thinking” and that much of the process is made of reactions as standardized as those of the keys on a typewriter.
    John Dos Passos (1896–1970)