NOTE: The following rules are tentative and subject to change. All rule updates will be posted here and announced on the mailing list.
The CEAS Spam-Filter Challenge Lab Evaluation Task is an anti-spam filtering competition in which anti-spam filters are tested on several archived collections of spam and ham messages. Some of the datasets will contain actual private email delivered to a large production server. Each filter will be asked to process the e-mail dataset sequentially, and label each message as spam or ham. Filters will be given a varying amount of training feedback for messages already classified by the filter, thus simulating an actual online spam filtering environment.
To preserve privacy, participants will not have access to the data; Instead, the filters will be run by the challenge organizers, and aggregate results will be reported. Participating filters must implement the TREC 2006/07 spam filter interface ("jig") to take part in the lab evaluation.
Essentially, a filter must implement the following four commands:
initialize
All steps necessary to install the software on a clean system and to prepare to classify a user's email.
classify [filename] [remaining label allowance] [remaining messages]
Read [filename] which contains exactly one email message, write one line of output:
class=[classification] score=[score] labelReq=[label request]
Where [classification] is either 'ham' or 'spam' and [score] is a number reflecting the filter's confidence in the prediction, so that thresholding on the [score] value could be used to achieve a desired tradeoff between true positives and false positives. The higher the [score] value, the more probable it is that the message is spam.
The [remaining label allowance] and [remaining messages] parameters are used to facilitate the more realistic "limited feedback" setting, in which the user does not provide gold standard judgments for all messages in the e-mail stream. The filter may decide to ask for the gold standard judgment of a message using one of the following three [label request] options:
noRequest
Makes no label request.
labelN
Requests a label. If no label is available (due to exhausted allowance), then no training is performed.
labelB
Requests a label. If no label is available (due to exhausted allowance), then bootstrap training is performed using the filter's prediction as the 'true' label for training.
A naive approach would be to have the filter make a 'labelN' request for every message. This would request labels and train normally until the feedback quota is exhausted, and then would not update for the remainder of the run.
train [judgement] [filename]
Note of gold-standard judgement for [filename] where [judgment] is either 'ham' or 'spam'.
finalize
Used for cleanup and stopping servers if needed.
Each filter will be evaluated using two training regimes:
Filters will be evaluated using two measures which combine the percentage of spam blocked and its false positive rate:
The LAM calculation will be smoothed. The exact formula used for the competition will be:
FPrate = (#ham-errors + 0.5) / (#ham + 0.5)
FNrate = (#spam-errors + 0.5) / (#spam + 0.5)
lam = invlogit((logit(FPrate) + logit(FNrate)) / 2)
where
logit(p) = log(p/(1-p))
invlogit(x) = (e^x)/(1 + e^x)