Operation
While others have done statistical Bayesian filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis gave a 99.87% accuracy; Holden and TREC 2005 and 2006. gave results of better than 99%, with significant variation depending on the particular corpus.
CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, a SVM, by mutual compressibility as calculated by a modified LZ77 algorithm, and other more experimental classifiers.
The CRM114 algorithms are multi-lingual and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate.
CRM114 is a good example of pattern recognition software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL.
At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.
Read more about this topic: CRM114 (program)
Famous quotes containing the word operation:
“It is critical vision alone which can mitigate the unimpeded operation of the automatic.”
—Marshall McLuhan (19111980)
“Waiting for the race to become official, he began to feel as if he had as much effect on the final outcome of the operation as a single piece of a jumbo jigsaw puzzle has to its predetermined final design. Only the addition of the missing fragments of the puzzle would reveal if the picture was as he guessed it would be.”
—Stanley Kubrick (b. 1928)
“Human knowledge and human power meet in one; for where the cause is not known the effect cannot be produced. Nature to be commanded must be obeyed; and that which in contemplation is as the cause is in operation as the rule.”
—Francis Bacon (15601626)