CEAS 2008 Spam-Filter Challenge Lab Evaluation Task

Competition Guidelines
Version 1.0

NOTE: The following rules are tentative and subject to change. All rule updates will be posted here and announced on the mailing list.

1. Overview

The CEAS Spam-Filter Challenge Lab Evaluation Task is an anti-spam filtering competition in which anti-spam filters are tested on several archived collections of spam and ham messages. Some of the datasets will contain actual private email delivered to a large production server. Each filter will be asked to process the e-mail dataset sequentially, and label each message as spam or ham. Filters will be given a varying amount of training feedback for messages already classified by the filter, thus simulating an actual online spam filtering environment.

2. Competition Model

To preserve privacy, participants will not have access to the data; Instead, the filters will be run by the challenge organizers, and aggregate results will be reported. Participating filters must implement the TREC 2006/07 spam filter interface ("jig") to take part in the lab evaluation.

Essentially, a filter must implement the following four commands:

initialize

All steps necessary to install the software on a clean system and to prepare to classify a user's email.

classify [filename] [remaining label allowance] [remaining messages]

Read [filename] which contains exactly one email message, write one line of output:

class=[classification] score=[score] labelReq=[label request]

Where [classification] is either 'ham' or 'spam' and [score] is a number reflecting the filter's confidence in the prediction, so that thresholding on the [score] value could be used to achieve a desired tradeoff between true positives and false positives. The higher the [score] value, the more probable it is that the message is spam.

The [remaining label allowance] and [remaining messages] parameters are used to facilitate the more realistic "limited feedback" setting, in which the user does not provide gold standard judgments for all messages in the e-mail stream. The filter may decide to ask for the gold standard judgment of a message using one of the following three [label request] options:

noRequest

Makes no label request.

labelN

Requests a label. If no label is available (due to exhausted allowance), then no training is performed.

labelB

Requests a label. If no label is available (due to exhausted allowance), then bootstrap training is performed using the filter's prediction as the 'true' label for training.

A naive approach would be to have the filter make a 'labelN' request for every message. This would request labels and train normally until the feedback quota is exhausted, and then would not update for the remainder of the run.

train [judgement] [filename]

Note of gold-standard judgement for [filename] where [judgment] is either 'ham' or 'spam'.

finalize

Used for cleanup and stopping servers if needed.

3. Competition Rules

  1. All contestants must register to participate in the competition.
  2. Each group can enter up to two anti-spam filters.
  3. Filters must be submitted before the start of the Live filtering competition, which begins August 5th, 14:00 UTC.
  4. Filters must implement the TREC 2007 spam-filter evaluation interface.
  5. Filters can use up to 1GB of RAM and up to 10GB of disk space. Please contact the organizers if you need to upload a spamfilter submission larger than 50MB.
  6. Filters will be required to process 1 (one) message per second on average. Filters that exceed their cumulative time quota will be stopped and all remaining unclassified messages in the e-mail stream will be automatically scored as 'ham'.
  7. To preserve privacy, filters will not have access to network resources when classifying the e-mail stream.
  8. Learning based filters can be pre-trained with any or all public and private data available to the contestant.
  9. Each filter will be evaluated using two training regimes:

  10. Feedback will contain the official judgment for a message. There will be no attempt to simulate user labeling errors.
  11. Filters will be evaluated using two measures which combine the percentage of spam blocked and its false positive rate:

    The LAM calculation will be smoothed. The exact formula used for the competition will be:

    FPrate = (#ham-errors + 0.5) / (#ham + 0.5)
    FNrate = (#spam-errors + 0.5) / (#spam + 0.5)
    lam = invlogit((logit(FPrate) + logit(FNrate)) / 2)

    where

    logit(p) = log(p/(1-p))
    invlogit(x) = (e^x)/(1 + e^x)

  12. All messages in each e-mail dataset will be scored. Results will be reported separately for each dataset.