CEAS 2007 Live Spam Challenge

Anti-Spam Filter Configuration
Version 1.0.0

Copyright IBM Corp. 2007

Overview

The goal of the CEAS Live Spam Challenge is to evaluate anti-spam filters in a realistic environment that closely approximates real-world anti-spam installations. Each filter will be asked to process a live e-mail stream and label each incoming message as spam or ham. The competition environment has been designed to allow most existing anti-spam systems to be used essentially out-of-the box by accurately simulating typical e-mail installations. This is accomplished by supporting several delivery modes including SMTP, POP3, IMAP4, and the TREC toolkit. The Competition Toolkit provides the ability to run a local copy of the competition environment and several related tools. The toolkit is not needed for integrating most anti-spam solutions into the competition environment.

E-Mail Model

The competition environment is managed by the Competition Controller. Each participant will be assigned a subdomain of ceas-challenge.cc to filter (i.e. <contestant>.ceas-challenge.cc). The Competition Controller acts as a perimeter SMTP server for each filter's subdomain. The controller delivers each message to each anti-spam filter based on the delivery mechanism configured for that filter (e.g. SMTP). The details of how each message is delivered and how each filter is expected to respond is dependent on the delivery mechanism chosen. The details for each delivery method are discussed below.

The e-mail for the competition will be collected from several production SMTP servers. The collection process records the original SMTP envelope and the original message contents and relays that data to the Competition Controller. The Competition Controller modifies the message to appear to be addressed to each contestant's simulated domain and then relays it to each contestant as appropriate. The relayed message is identical to what the anti-spam filter would have received if the message had actually been sent to the perimeter server for the simulated domain and relayed to the anti-spam filter.

The primary methods used by the Competition Controller to relay test messages are SMTP and POP3/IMAP4. The SMTP mechanism supports SMTP-based solutions, and the POP3/IMAP4 mechanism supports POP3/IMAP4 solutions. In addition, the POP3/IMAP4 solution can be combined with other tools to support the TREC Toolkit as well operation behind a firewall.

Feedback Model

The Competition Controller simulates a simple user feedback model in which the recipient is given the opportunity to tag any message they receive as spam or ham. These labels are sent to each anti-spam filter to be used as training data. The simulation attempts to model real users in that only a fraction of the e-mail received will be labeled by the user, and that there may be a significant delay between when a message is delivered and when it is labeled. All feedback will be accurate and no attempt will be made to simulate user errors.

Please note that the toolkit and the test stream use a trivial feedback model in which feedback is provided for every message with a fixed delay. This model will NOT be used in the actual competition.

Configuration

The configuration of each filter is specified using a Java-style properties file. Each contestant will be asked to submit a configuration file to be installed on the Competition Controller. The configuration file tells the Competition Controller how to communicate with each filter.

There are three primary types of communication that must occur between the Competition Controller and a participating filter: submit a message to be classified, receive the classification response, and send training examples. The configuration file allows the communication parameters for each step to be specified separately. A filter can be configured to accept classification requests via SMTP provide classifications via POP3/IMAP4, and accept judgments via e-mail to a special address. Filters can also be configured with separate communication parameters for spam and ham judgments.

Each message sent between the Competition Controller and the anti-spam filter should contain a special 'x-ceas-tracking' header. This header is used to authenticate messages to the Competition Controller as well as to correlate each of the three message types described above with the original test message. This header will be included in all messages sent to the anti-spam filter. The header must be copied verbatim to all response messages sent by the anti-spam filter. Response messages that do not contain the appropriate tracking header will be discarded.

The tracking header is also useful to correlate classification requests and filter judgments. The tracking header is formatted as a MIME-style attribute-value list. The same message identifier will be used for classification requests and the corresponding user-feedback message.

QUICK START GUIDE

This section describes several methods for integrating anti-spam servers into the test environment. The methods described here should cover a majority of use cases. A complete description of all the configuration options available appears in Section 3.0.

The Competition Toolkit contains template configuration files for each of the scenarios described below. The template files also include descriptions of the most important configuration options.

SMTP Filters

Overview

In the basic SMTP model, the Competition Controller delivers incoming mail to the anti-spam filter via SMTP. The anti-spam filter is expected to classify each message as spam or ham, add a field to the message indicating its classification, and relay the message back to the Competition Controller for scoring. The Competition Controller records the filter's response by extracting the appropriate header field and analyzing its contents. The header used to indicate the the filter's decision as well as its format is configurable.

User feedback is provided by sending the original message back to the anti-spam filter. Typically, this is done by addressing feedback to special "spam-report" and "ham-report" addresses based on the message's correct label.

Configuration

TEMPLATE: contestants.templates/smtp.properties

Set the address of your SMTP server:

filter.smtp.server: <host-address>

Configure your filter to act as a relay between a perimeter SMTP server and a local delivery server. Set both incoming and outgoing server to the address of the Competition Controller. The Competition Controller is designed to ensure that this loopback configuration will not create a mail loop.

The server must be configured to accept e-mail for the sub-domain assigned to your anti-spam filter. The sub-domain will be a concatenation of the filter name you request and "ceas-challenge.cc". For instance, myfilter.ceas-challenge.cc.

The list of valid users at myfilter.ceas-challenge.cc will not be provided. Configure your anti-spam server to relay all local mail back to the Competition Controller.

The anti-spam filter should be configured to tag incoming messages as spam or ham. The Competition Controller by default expects spam e-mail to be tagged with a "x-ceas-label" header with the value "spam". If a different header is needed, set the following properties in the configuration file:

response.smtp.header: <my-response-header>
response.smtp.regex: <my-spam-regexp>

The regular expression must be written in the java.util.regex format. See http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html for details.

Alternatively, the anti-spam filter can be configured to block all spam messages and only deliver ham. If this setup is desired, comment out the settings for "response.smtp.header" and "response.smtp.regex" and add the following:

response.default: spam
response.autolabel: ham

Many anti-spam filters allow user feedback to be provided via addresses for reporting spam and ham. If your filter supports such an address, set one or both of the following properties:

feedback.spam.smtp.recipient: <spam-report-username>
feedback.ham.smtp.recipient: <ham-report-username>

If reporting is not done through a special address, see the reference section for other alternatives. If user feedback is not supported at all, then add the following property values:

feedback.spam.type: none
feedback.ham.type: none

IMAP4/POP3 Filters

Overview

The Competition Controller server includes a IMAP4/POP3 server and each contestant will be provided an account on that server. Contestants can request that all incoming e-mail be delivered to their account for filtering by their IMAP4- or POP3-based anti-spam filter.

IMAP4-based filters will be expected to move any message identified as spam into an appropriate spam folder. The Competition Controller will monitor the contestant's spam folder and credit the filter for detecting any spam message that is placed in the spam folder. Any feedback provided by the user will be delivered directly to special "judgedspam" and "judgedham" folders. The anti-spam filter can monitor these folders and train on the messages found accordingly.

POP3-based filters will be expected to delete from the server any message identified as spam, and to leave on the server any message classified as ham. There is no feedback mechanism for POP3-based solutions. POP3-based solutions will either have to function without feedback, or use a non-POP3 based feedback mechanism such as SMTP.

IMAP4 Configuration

TEMPLATE: contestants.templates/imap4.properties

Configure your filter to access the Competition Controller IMAP4 server. Set your filter's IMAP4 account and password as appropriate. For locally run Competition Controller's, this will be the account and password you created during the installation process. For the public test server, that data will have been provided when requesting access to the test server.

Setup your filter to process all e-mail received in the filter's inbox. Any messages identified as spam should be moved to the filter's "spam" folder. If a different folder is desired, modify the following property setting to read:

response.spam.imap.folder: <my-spam-folder>

Messages tagged as spam by the user will be placed in the "judgedspam" folder, and messages tagged as ham will be placed in the "judgedham" folder. If supported, configure your filter to scan these folders for new training examples. If different folder locations are required, modify the following property settings:

feedback.spam.imap.folder: <my-train-spam-folder> feedback.ham.imap.folder: <my-train-ham-folder>

Adjust the polling interval on your server to ensure that messages are processed within the one minute time limit.

POP3 Configuration

TEMPLATE: contestants.templates/pop3.properties

Configure your filter to access the Competition Controller POP3 server. Set your filter's IMAP4 account and password as appropriate. For locally run Competition Controller's, this will be the account and password you created during the installation process. For the public test server, that data will have been provided when requesting access to the test server.

Setup your filter to process all e-mail received in the filter's inbox. Any messages identified as spam should be deleted. Any message identified as ham must be left on the server. However, some POP3 filters cannot be configured to leave ham messages on the server. The standard POP3 template will not work and a custom setup is required. See the reference section for possible alternatives.

The POP3 template does not provide any mechanism for receiving user feedback. See the reference section for alternative pathways for accessing feedback data.

Adjust the polling interval on your server to ensure that messages are processed within the one minute time limit.

TREC Configuration

Overview

The TREC toolkit is a framework for testing anti-spam filters that was developed for the TREC anti-spam track. We provide a bridge to the TREC toolkit as many existing anti-spam systems already support the TREC framework. Support for the TREC toolkit is implemented by combining the IMAP4 and SMTP models above with scripts for classifying messages from the IMAP4 server and sending classification results back via SMTP.

Configuration

TEMPLATE: contestants.templates/trec.properties

The TREC Toolkit bridge works by combining the IMAP4 delivery method, fetchmail, and shell scripts for converting fetchmail actions into calls to TREC-based filters.

The TREC Toolkit bridge is run with the following command:

trec-bridge <controller-address> <filter-name> <trec-filter-directory>

SMTP behind a Firewall

Overview

The standard SMTP setup will not work for participants whose anti-spam filter sits behind a firewall. One good solution is to pull the test stream from the Competition Controller's IMAP4 server and relay each message to the anti-spam filter using fetchmail.

A sample fetchmail configuration file for performing this task is provided in fetchmail/smtp-firewall.fetchmail. The configuration file will read all mail in the the IMAP inbox, judgedspam, and judgedham folders and relay to the specified SMTP server. The anti-spam filter can detect which IMAP folder each message originated based on the presence and value of the 'X-CEAS-Judgment' header. If no header exists, the message is from the INBOX and should be classified. If the header exists, then the message is a judgment and the "X-CEAS-Judgment" header indicates the correct classification of the message.

The configuration file provided is just a sample and will not work for all filters. However, fetchmail is a very powerful tool and it should be able to connect most anti-spam filters behind a firewall into the competition environment.

Configuration

TEMPLATE: contestants.templates/smtp-firewall.properties

Configure your anti-spam filter as described in Section 2.1, but do not modify any values of smtp-firewall.properties that begins with "filter.*" or "feedback.*".

Download the Competition Toolkit. Make a copy of

/opt/fetchmail/smtp-firewall.fetchmailrc

If you are using a local Competition Controller, edit the above file and replace:

poll smtp01.ceas-challenge.cc

with:

fetchmail -f smtp-firewall-copy.fetchmailrc --user <filter-name> -d 10s

    
REFERENCE
---------
Submitting Messages for Classification
------------------------------------------
  filter.type: <delivery-type>				[required]
      Specify the primary delivery mechanism for submitting messages to
      be classified.  Allowable values are:
          smtp
              Relay the competition e-mail stream via SMTP.  The
              message will be relayed with the original SMTP "MAIL
              From" and "RCPT To" address lists.  The SMTP "HELO" and
              TCP/IP connection information will be stored in the
              first received line.
	  imap4
	      Relay the competition e-mail stream to the local
              POP3/IMAP4 server.  The message will be delivered to the
              contestant's account on the competition POP3/IMAP4
              server.  The SMTP "MAIL From" will be recorded in the
              "Return-Path" header.  The SMTP recipient list will be
              recorded in the "Delivered-To" header.
	  email
	      The competition stream will be sent to the specified
	      e-mail address.  The message will be sent with the
	      original SMTP "MAIL From".  The original recipient list
	      is not available when e-mail delivery is used.
          none
              The competition stream will not be delivered anywhere.
	      Not very useful, but included here as this delivery type
	      is useful for dropping feedback messages.
  filter.smtp.server: <host-address>			[required for smtp]
  filter.smtp.port: <port-number>			[optional]
      Sets the host address and port used to relay competition
      messages when the filter type is set to "smtp".  Port number
      defaults to 25.
  filter.smtp.recipient: <username>			[optional]
      Configure SMTP delivery to send all messages to <username>
      rather than the message's original recipient list. 
  filter.imap.folder: <folder-name>		        [optional]
      Configure POP3/IMAP4 delivery to redirect messages to the
      indicated folder instead of the default, the user's inbox.
  filter.email.address: <email-address>			[required for e-mail]
      Destination address for e-mail based delivery.
3.2 Sending Back Classifications 
--------------------------------
Classifications are sent back to the Competition Controller using
either SMTP or via POP3/IMAP4 mailboxes.  All responses take the form
of an e-mail message.  It is recommended that the original e-mail
message be sent back for this purpose as it helps ensure the response
is correctly recorded.  At a minimum, the "x-ceas-tracking" header
present in the original classification request must be sent back with
the filter's response.  For security reasons, any response received
without the original tracking header will be discarded.
  response.type <type>					[required]
      Specify the mechanism used by the anti-spam filter to return
      responses back to the Competition Controller for scoring.
      Allowable values are:
          smtp
	      The anti-spam filter will relay the original message
              back to the Competition Controller via SMTP with its
              response stored in a special header.
	  imap4
	      The anti-spam filter will move all spam messages into a
	      special junk folder.  The Competition Controller will
	      scan the specified junk folder and score each message
	      found in that folder as spam.  All other messages are
	      assumed to be classified as ham.
	  pop3
	      The anti-spam filter will delete all spam from its
	      POP3/IMAP4 account.  The Competition Controller will
	      scan the filter's INBOX and score as spam any message
	      that is deleted.  Messages that remain in the INBOX
	      after the response timeout period has expired will be
	      marked ham.
  response.smtp.header: <mime-header>			[optional]
      Configures the MIME header used to store the filter's SMTP-based
      response.  If this property is not specified, a value of
      'x-ceas-label' is assumed.
  response.smtp.regex: <java-regex>			[optional]
      Specifies the regular expression for determining whether the
      filter's response header represents a spam classification.  Any
      message in which the response header matches <java-regex> will
      be scored as spam.  All other responses will be treated as ham
      classifications.  
      The regular expression uses the Java regular expression syntax
      (see java.util.regex package for details), which in turn is
      based on the Perl regular expression syntax.  The default is
      "(?i:spam)" which performs a case insensitive match for the
      string literal "spam".
  response.spam.imap.folder <mailbox>			[optional]
      Specify the IMAP4 folder to check for IMAP4-based spam
      responses.  The default is "spam".
3.3 Sending Judgments
---------------------
The mechanisms available for sending judgments closely parallel those
available for sending the test stream.  Separate delivery options can
be specified for spam and ham judgments.  Judgments will be sent using
the original test message but with a 'x-ceas-judgment' header added
with the value of "spam" or "ham" as appropriate.  The original
message and the judgment message will be sent with identical message
id's stored in their respective "x-ceas-tracking" headers.
  feedback.spam.type: <delivery-type>			[required]
  feedback.ham.type: <delivery-type>			[required]
      Specify the primary delivery mechanism for sending judgments to
      the filter.  Allowable values are:
          smtp
              Relay judgments via SMTP.
	  imap4
	      Relay judgment messages to the filter's judgment folders
	      on the Competition Controller's POP3/IMAP4 server.  By
	      default, spam judgments go to "judgedspam" and ham
	      judgments go to "judgedham".
	  email
	      Send judgments via e-mail.
          none
              Do not send judgments.  
  feedback.spam.smtp.server				[optional]
  feedback.spam.smtp.port
  feedback.ham.smtp.server
  feedback.ham.smtp.port
      Sets the host address and port used to relay competition
      messages when the feedback type is set to "smtp".  If not
      specified, the feedback server address and port number default
      to the server and port defined in "filter.smtp.server" and
      "filter.smtp.port"
  feedback.spam.smtp.recipient: <username>		[optional]
  feedback.ham.smtp.recipient <username>
      Configure judgments be sent to <username> rather than the
      message's original recipient list.  This setting is useful for
      solutions that provide special addresses for sending false
      positive and false negative reports.
  feedback.spam.imap.folder: <folder-name>		[optional]
  feedback.ham.imap.folder: <folder-name>
      Configure judgments to be sent to the indicated folder instead
      of the default.
  feedback.spam.email.address: <email-address>		[required for e-mail]
  feedback.ham.email.address: <email-address>
      Destination address for sending judgments via e-mail.