Re: SPAM: Bayesian filtering

From: Christian Weisgerber (naddy@mips.inka.de)
Date: Sun Aug 18 2002 - 11:26:18 MDT


Robert J. Bradbury <bradbury@aeiveos.com> wrote:

> http://www.paulgraham.com/spam.html

Very interesting. I suppose a lot of people have had this idea
kicking around in their head, but this is the first I've heard of
somebody actually sitting down and doing the work, i.e. doing the
implementation, tuning the parameters, and reporting success.

Some random comments:

I don't actually delete the spam I receive. I save it to a special
junk folder. Since September 1999, I have collected some 7800
messages, totalling 79MB, of spam and similar garbage, such as MS
Outlook worms. I already have the material to seed an individual
filter.

Forget simple sender-based whitelists. The spammers are already
exploiting sociograms, probably culled from public mailing list
archives. E.g., several OpenBSD developers report that they regularly
get spam that purports to be from other OpenBSD developers.

I see the author has already hit on the idea of extending his
approach to Markov chains.

As for accumulating a giant corpus of spam, there are lots of spam
fighters out there who maintain extensive collections of all the
spam they intercept and who are more than willing to share it.

Generally, you cannot rely on cooperative spam filtering. Vipul's
Razor is such a project, and pretty much from the beginning some
idiots tagged every message from legitimate mailing lists such as
BugTraq as spam and fed this into the network. For everything you
may be willing to build, there'll be a vandal who will want to smash
it to pieces, just for destruction's sake. Sad but true.

The appendix on defining spam triggers a trip down memory lane.
Actually, the widely agreed-on, self-explanatory term in use for
many years is "unsolicited bulk e-mail" (UBE). And yes, in the old
days when "spam" still referred to Canter & Siegel-style USENET
abuse, we did get non-commercial UBE that espoused religious views
or urged us to vote such-and-such in some Californian election.

> Now, if there were just such a filtering system for Linux
> machines...

So sit down and write it. Or pay somebody to do so. No? Obviously
your cost of suffering from spam is still less than that of funding
a solution. We need more spam!

Actually, now that the idea has proven itself a first time, I expect
it to spread to Unix-based filters rather sooner than later.

-- 
Christian "naddy" Weisgerber                          naddy@mips.inka.de


This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 09:16:13 MST