Thomas Bayes, a Presbyterian minister and mathematician born just over 300 years ago, would be shocked to see most of the email messages that bid for our attention nowadays. He would be thrilled to know, however, that his statistical inference theorem has inspired a potent counterattack. An open source project called SpamBayes has emerged as a powerful weapon in the war on spam. There are a few different implementations of SpamBayes.
I’ll focus here on an Outlook add-in, written by renowned Python hacker Mark Hammond. I’ve been skeptical about the long-term prospects for content-based email filtering. But the Python-based SpamBayes engine, and Hammond’s brilliant add-in (also written in Python), are rapidly making me a believer.
Several email programs, including the Mail program bundled with Mac OS X, use Bayesian techniques to enable users to train their systems to distinguish between spam and non-spam (aka ham). Experts debate how the term Bayesian is relevant to this game of classification, but the core ideas in Paul Graham’s influential 2001 paper, A Plan for Spam, make sense intuitively. Every message bears evidence both for and against the hypothesis that it is spam. Your disposition of every message tests both hypotheses and systematically improves the filter’s ability to separate spam from ham.
As Graham pointed out, the judgments involved are highly individual. For example, the commercial email that I want to receive (or reject) will differ from the ones you want (and don’t want) according to our interests and tastes. A filter that works on behalf of a large group, such as SpamAssassin, which checks and often rewrites my idg.com mail, or CloudMark’s SpamNet (formerly Vipul’s Razor), which collaboratively builds a database of spam signatures, will typically agree with SpamBayes on what I call the Supreme Court definition of spam: You know it when you see it. What sets SpamBayes apart is its ability to learn, by observing your behavior, which messages you do want to see, and the ones you don’t.
If you use Outlook 2000 or Outlook XP, it’s easy — and free — to give the SpamBayes Outlook add-in a whirl. If you already have Python installed, you can acquire the source and set up SpamBayes and the add-in according to the usual conventions for open source packages. I did that, but because I’m well aware that typical Outlook users don’t have Python installed and won’t want to deal with an open source-style installation, I also tested the binary installer available at Starship Python. It worked beautifully, installing SpamBayes plus the subset of the Python needed to run it.
SpamBayes appears as a toolbar item called Anti-Spam. To use the add-in effectively, you’ll need to point it to a pile of ham. These messages may simply be the contents of your inbox if you keep it squeaky clean. But they can also live in other folders. That’s great news, because I use Outlook’s filters aggressively to route messages from known correspondents to folders.
You’ll also need to point SpamBayes to a big pile of spam. In my case, that folder was called NotToMe, where an Outlook filter has long been accumulating messages that are neither To: nor CC: my primary email addresses. This simple rule is so effective at filtering spam that it was my sole defence until IDG installed SpamAssassin a few months ago. But lately, as I’m sure you’ve noticed, the volume of spam has exploded. Even with SpamAssassin, the hassle of plucking the few wanted messages from my NotToMe folder, plus the growing amount of spam sent to my primary e-mail addresses (and not caught by SpamAssassin), spurred me to take the next step.
After you finish training, you designate another folder — I called mine MaybeSpam — for dubious messages. This third category is an extra wrinkle added by SpamBayes to the binary spam/ham technique spelled out in Paul Graham’s original paper. Messages can present conflicting evidence — that is, they score high (or low) for both ham and spam. In these cases, SpamBayes asks you for a ruling.
So long, spam
Given this setup, you turn on filtering and observe over time. The add-in runs inbound messages to your inbox (or other designated folders) through the SpamBayes classifier. Then it routes what is certainly spam to the Spam folder, and what might be spam to the MaybeSpam folder. All other messages land in your inbox, or wherever your regular filters normally route them. But every message gets tagged with a user-defined field that stores its “spamminess” percentage. You can add this field to customised Outlook views of your folders, and sort on it — a useful way to gauge how well you’ve trained the system.
When a wanted message lands in Spam or MaybeSpam, you use the Recover from Spam button to restore it to its original folder, and train it as a good message. Likewise, you use Delete As Spam to nuke an unwanted message that lands in one of your “good” folders, and train it as a bad message. Results, for me, were immediate and spectacular. SpamBayes nailed a number of spams that SpamAssassin let through. SpamAssassin was fooled by a penis enlargement ad in Spanish, for example, while SpamBayes nailed it. But other catches involve subtler discrimination. It appears that SpamBayes really can learn to distinguish between messages about legitimate products and services that I care about, and messages about equally legitimate stuff that doesn’t matter to me. Can messages that are merely off-target really be defined as spam? I won’t quibble. Life is short. If software can make my computer act like the intelligent assistant it’s supposed to be, bring it on.
Time will be the judge of SpamBayes. I’m optimistic, and there’s no shortage of spam sufferers rooting for it to succeed. I’ve already touted SpamBayes on my Weblog, which has drawn a lot of interesting commentary.
Meanwhile, a minor miracle has occurred. I actually look forward to fetching my email. Scanning the many messages landing in Spam, and marking them as read, is quick because there have so far been no — I repeat, no — false positives. I haven’t yet delegated ultimate power to my new assistant; I still review its decisions. But my confidence grows daily, and I’m close to routing the crap straight to the bit bucket where it belongs.
IDG Communications is the parent company of ARN.