Bayesian Spam Filtering - Process

Process

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.

The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam.

Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

Read more about this topic:  Bayesian Spam Filtering

Famous quotes containing the word process:

    A process in the weather of the world
    Turns ghost to ghost; each mothered child
    Sits in their double shade.
    A process blows the moon into the sun,
    Pulls down the shabby curtains of the skin;
    And the heart gives up its dead.
    Dylan Thomas (1914–1953)

    In contrast to revenge, which is the natural, automatic reaction to transgression and which, because of the irreversibility of the action process can be expected and even calculated, the act of forgiving can never be predicted; it is the only reaction that acts in an unexpected way and thus retains, though being a reaction, something of the original character of action.
    Hannah Arendt (1906–1975)

    A man had better starve at once than lose his innocence in the process of getting his bread.
    Henry David Thoreau (1817–1862)