Process
Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.
After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.
The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam.
Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.
Read more about this topic: Bayesian Spam Filtering
Famous quotes containing the word process:
“At the heart of the educational process lies the child. No advances in policy, no acquisition of new equipment have their desired effect unless they are in harmony with the child, unless they are fundamentally acceptable to him.”
—Central Advisory Council for Education. Children and Their Primary Schools (Plowden Report)
“... in the working class, the process of building a family, of making a living for it, of nurturing and maintaining the individuals in it costs worlds of pain.”
—Lillian Breslow Rubin (b. 1924)
“That which endures is not one or another association of living forms, but the process of which the cosmos is the product, and of which these are among the transitory expressions.”
—Thomas Henry Huxley (182595)