[plug] OT: SPAM

Craig Ringer craig at postnewspapers.com.au
Mon May 3 00:40:25 WST 2004


On Mon, 2004-05-03 at 00:19, Craig Ringer wrote:
> On Sun, 2004-05-02 at 20:50, Michael Collard wrote:
> > After training the bayesian filtering the detection rate is 100% on that
> > sample :)
> > 
> > Can anyone else offer a pile of SPAM (100 - 300 messages) to test and
> > train spamassassin?
> 
> http://www.postnewspapers.com.au/~craig/files/55a90398e1f3a1c1bd5a7e62bc362dc2/DeSAJunk.bz2
> 
> About 800 messages, so it's a bit over what you wanted, but it's not
> hard to chop the mailbox up. That way you can use different chunks for
> different runs. It's only 1.8MB, so it's not a big download. 
> 
> I think I got all the SpamAssassin, ClamAV, MimeDefang etc headers out
> of it, but haven't checked in detail. 

You'll want to grep out the 'X-Sieve' headers, too:

egrep -v '^X-Sieve: CMU Sieve 2.2 ' DeSAJunk

BTW, that's my spam mailbox for April. It's grown by ~100
messages/(month^2) over the last few months, which I'm less than
impressed by. OTOH, less and less arrive in my INBOX every month thanks
to SpamAssassin :-)

I've just realised that that mailbox might not be the best base to train
your filters on anyway. My mail system strips out HTML parts from
messages that include both text/plain and text/html alternatives, and if
yours does not this will give SpamAssassin a skewed view of what "spam"
looks like. It's worth using it to check the effectiveness of your
filters, but I wouldn't recommend using it with sa-learn unless you plan
to use MimeDefang or similar.

Craig Ringer




More information about the plug mailing list