[plug] OT: SPAM
Craig Ringer
craig at postnewspapers.com.au
Mon May 3 00:40:25 WST 2004
On Mon, 2004-05-03 at 00:19, Craig Ringer wrote:
> On Sun, 2004-05-02 at 20:50, Michael Collard wrote:
> > After training the bayesian filtering the detection rate is 100% on that
> > sample :)
> >
> > Can anyone else offer a pile of SPAM (100 - 300 messages) to test and
> > train spamassassin?
>
> http://www.postnewspapers.com.au/~craig/files/55a90398e1f3a1c1bd5a7e62bc362dc2/DeSAJunk.bz2
>
> About 800 messages, so it's a bit over what you wanted, but it's not
> hard to chop the mailbox up. That way you can use different chunks for
> different runs. It's only 1.8MB, so it's not a big download.
>
> I think I got all the SpamAssassin, ClamAV, MimeDefang etc headers out
> of it, but haven't checked in detail.
You'll want to grep out the 'X-Sieve' headers, too:
egrep -v '^X-Sieve: CMU Sieve 2.2 ' DeSAJunk
BTW, that's my spam mailbox for April. It's grown by ~100
messages/(month^2) over the last few months, which I'm less than
impressed by. OTOH, less and less arrive in my INBOX every month thanks
to SpamAssassin :-)
I've just realised that that mailbox might not be the best base to train
your filters on anyway. My mail system strips out HTML parts from
messages that include both text/plain and text/html alternatives, and if
yours does not this will give SpamAssassin a skewed view of what "spam"
looks like. It's worth using it to check the effectiveness of your
filters, but I wouldn't recommend using it with sa-learn unless you plan
to use MimeDefang or similar.
Craig Ringer
More information about the plug
mailing list