2nd February 2004 — Spam Filtering

Iain and Jaz are going to lead a look at Spam Filtering. This is intended as an informal look at an interesting adversarial classification problem . Please bring your ideas so that we can not only look at where spam filtering is, but where, in our humble opinions, we think it should go.

Here is one overview of some of the filtering techniques out in the wild:

Learning Spam: Simple Techniques for Freely-Available Software Bart Massey, Mick Thomure, Raya Budrevich, and Scott Long, Portland State University (abstract, pdf).

This might be of interest: Conference on Email and Anti-Spam (CEAS)

Look at a list of tricks employed by spammers. While it isn't necessarily insurmountable, the adversarial aspect of the problem cannot be ignored!

So-called "Bayesian Spam Filtering" has got a lot of attention. In fact the number-one hit on Google for "Bayesian" is currently about spam-filtering. It's sad but true that many people have only heard of "Bayesian" in this context and are unaware of what Bayesian statistics are actually about.

Have a glance at the heuristics used in spam assassin. Also technical measures, eg hashcash and legal measures such as the habeus haiku. Other software is ifile and popfile.