NOTE: The following rules are tentative and subject to change. The rules for active learning have not yet been finalized and therefore are not included below. All rule updates will be posted here and announced on the mailing list.
The CEAS Spam-Filter Challenge Live Spam Task is an online anti-spam filtering competition in which anti-spam filters are tested on a live stream of spam and ham messages. Each filter will be asked to process a live e-mail stream and label each incoming message as spam or ham. The competition environment has been designed to allow most existing anti-spam systems to be used essentially out-of-the box by accurately simulating typical e-mail installations.
Each participant will be assigned a subdomain of ceas-challenge.cc to filter. The competition e-mail stream will be multiplexed to each participating filter such that each filter receives an essentially identical set of messages. The messages received by each filter will only differ in that e-mail addresses will be rewritten to match the subdomain assigned to the destination filter.
The test e-mail stream will be collected from several production SMTP servers and relayed to the Competition Controller. The Competition Controller will modify the message to appear to be addressed to each contestant's sub-domain and then relay it to each contestant as appropriate. The relayed message will be as close as possible to what the anti-spam filter would have received if the message had actually been sent to the perimiter server for the simulated domain and relayed to the anti-spam filter. Every attempt will be made to preserve the original SMTP "HELO", "MAIL From", and "RCPT To" commands as was originally presented to the capturing server. If we are unabe to preseve the original headers, simulated headers will be provided that respect the properties most likely to be used by anti-spam systems. The properties preserved by the simulated headers include but are not limited to DNS, rDNS, DNSRBL, and SPF data for all envelope addresses. Unfortunately, due to the anonymization process, we cannot support DKIM signatures. All DKIM headers will be striped from the test stream.
Exactly one Received header will be added to the captured message. The "from part" of the Received header will record the SMTP "HELO" and connection information using a Sendmail-style received line. For example,
Received: from spammer.bulkmail.com {openrelay.com [1.2.3.4])
by ceas-challenge.cc (8.13.1/8.13.1)
with ESMTP ID l68FDvK031975; Mon, 16 Jul 2007 11:13:57 -0400
Indicates a HELO of "spammer.bulkmail.com", a connecting IP address of 1.2.3.4, and the results of a reverse DNS check on 1.2.3.4 that returned "openrelay.com". The first received line can safely be used for IP blacklisting or SPF checks.
Most messages will be delivered within minutes of its original receipt. The test stream may contain some previously recorded messages. Recorded messages will also be updated to ensure they contain appropriate dates, IP addresses, and DKIM signatures to ensure that most widely known anti-spam algorithms will behave correctly.
Filters will be evaluated using two measures which combine the percentage of spam blocked and its false positive rate:
Filters which cannot provide a numeric score are welcome, however, such filters will be evaluated using only the LAM measure. The LAM calculation will be smoothed. The exact formula used for the competition will be:
FPrate = (#ham-errors + 0.5) / (#ham + 0.5)
FNrate = (#spam-errors + 0.5) / (#spam + 0.5)
lam = invlogit((logit(FPrate) + logit(FNrate)) / 2)
where
logit(p) = log(p/(1-p))
invlogit(x) = (e^x)/(1 + e^x)