Saturday, May 17, 2008 2:06 PM bart

Why SPAM filters need spell checkers

On my corporate mail address I get almost no single piece of SPAM. However, on my private address - running on some Belgian hoster - I get a bunch of it, and to some extent I really enjoy doing the manual SPAM filtering (I got really good at it already). However, this one caught my attention today:

Bacheelor, MasteerMBA, and Doctoraate diplomas available in the field of your choice that's right, you can even become a Doctor and receive all the benefits that comes with it!

The three spelling mistakes are clearly there to mislead SPAM filters. So, I decided to do a little test. Using a plain telnet session like this:

telnet smpt.myundisclosedISP.com 25
HELO someirrelevantdomain.com
MAIL FROM:<same address as the original spam mail>
RCPT TO:<my undisclosed mail address>
DATA
Bachelor, MasterMBA, and Doctorate diplomas available in the field of your choice that's right, you can even become a Doctor and receive all the benefits that comes with it! 
.
QUIT

Guess what? The mail ended up in the webmail SPAM folder. Looks like putting a spell checker in a SPAM filter and bring all the checker's word suggestions in the equation for the SPAM score would be a good idea. On to reading even more generous offers in my inbox now :-)...

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks

Filed under:

Comments

# re: Why SPAM filters need spell checkers

Saturday, May 17, 2008 2:34 PM by Knaģis

I would rather say that the spam checker should check the spelling and count wrong/correct words to detect spam. My usual spam consists mainly of words that are written in such a bad spelling that I am having problems understanding it sometimes. So - if every other word in the message has spelling error - it is 100% spam.

# re: Why SPAM filters need spell checkers

Sunday, May 18, 2008 1:36 PM by David Arno

A problem with this idea is that the spammers don't sit on their backsides, they are constantly reacting to new features in spam filters. So spam filters running spell-checkers would simply prompt the spammers to use spellings that misled the spell-checkers. For example, the British English spell checker I have installed on Firefox offers "bachelor"  as an alternative to "bacheelor", but not to "batcheelar".  

# re: Why SPAM filters need spell checkers

Monday, May 19, 2008 2:15 AM by Wim

I think people haven't used spelling suggestions for this problem because most spelling checkers are a bit "optimistic", they suggest words that barely relate to the word the user typed in. Granted, this could be fixed by lowering the treshold for suggestions, but I still think it would introduce a lot of false positives.

I know you're going to dislike me saying this, but I rather like Google's approach: they keep a distributed database of all mails marked as spam by users, integrate some kind of scoring system and use this to mark spam. I've never had a false positive yet, and I only get one spam mail every 200 or 300 real mails.

Maybe there should be some sort of centralized database which smaller mail providers, such as small to medium sized companies, could use to get a similar system. Just thinking out loud.

I'm sure Microsoft could pull this off on their own, especially if they use their Live mail to gather statistics.

# re: Why SPAM filters need spell checkers

Monday, May 19, 2008 3:17 AM by bart

Hi folks,

Thanks for the comments. Obivously, I didn't follow the links in that particular mail, so I didn't order my PhD in anti-spam technologies yet.

It's indeed the case - as usual - that the bad guys have plenty of time to be creative while the poor implementers of protective mechanisms have to work under various constraints. At the very least, it keeps us sharp. On my Exchange 2007 based corporate mailbox, I haven't seen a single spam message entering my Outlook just yet, so I've the feeling we're on the right track :-).

The approach of centralized databases is appealing as well; anti-phishing in IE7+ is based on similar concepts.

But there's obviously no silver bullet; here too the principle of defense in depth applies which can be mapped somewhat onto the OSI diagram: on the transport/network layer technologies like SenderID can help to restrict mail routing between domains (essentially fighting the shortcomings of the ancient SMTP protocol) while on a higher level in the application layer typical SPAM analysis mechanisms kick in almost in a data mining fashion.

Cheers,

-Bart

# re: Why SPAM filters need spell checkers

Monday, May 19, 2008 11:31 AM by kfarmer

Bart:  I often get spam in my corp box.  Not nearly as many as at home (where I receive roughly 1000/day, if you count bounces due to spam spoofing my address).

spell-checking would have to be fairly sophisticated.  Perhaps add a heuristic that works something like:

- correct punctuation to a set of possible character matches:  "se||" to "seil", "seli", "seii", "sell"

- apply soundex

- spam check the soundex to mitigate spelling problems