Algorithmically determining Store-and-forward MTA Relays using DomainKeys
Miles Libbey, Peter Ludemann
Store-and-forward MTA relaying servers have frequently presented problems to various antispam techniques, such as IP-based reputation or email authentication. Algorithms that find email relaying servers can use knowledge about a domain’s outbound IP addresses combined with cryptographic domain authentication frameworks such as DomainKeys. This paper presents one such algorithm.
An Adaptive, Semi-Structured Language Model Approach to Spam Filtering on a New Corpus
Motivated by the absence of rigorous experimentation in the area of spam filtering using realistic email data, we present a newly-assembled corpus of genuine and unsolicited (spam) email, dubbed GenSpam, to be made publicly available. We also propose an adaptive model for semi-structured document classification based on language model component interpolation. We compare this with a number of alternative classification models, and report promising results on the spam filtering task using a specifically assembled test set to be released as part of the GenSpam corpus.
An Email and Meeting Assistant using Graph Walks
Einat Minkov, William Cohen
We describe a framework for representing email as well as meeting information as a joint graph. In the graph, documents and meeting descriptions are connected via other non-textual objects representing the underlying structure-rich data. This framework integrates content, social networks and a timeline in a structural graph. Extended similarity metrics for objects embedded in the graph can be derived using a lazy graph walk paradigm. In this paper we evaluate this general framework for two meeting and email related tasks. A novel task considered is finding email-addresses of relevant attendees for a given meeting. Another task we define and evaluate is finding the full set of email-address aliases for a person, given the corresponding name string. The experimental results show promise of this approach over other possible methods.
An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies
Fulu Li, Mo-Han Hsieh
We conducted an empirical study of the clustering behavior of spammers and explored the group-based anti-spam strategies. We propose to block spammers as groups instead of dealing with each spam individually. We empirically observe that, with a certain grouping criteria such as having the same URL in the spam mail, the relationship among the spammers has demonstrated highly clustering structures. By examining the spam mails gathered in a seven-day period, we found that if a spammer is associated with multiple groups, it has a higher probability of sending more spam mails in the near future. We also observed that the spam mails from the same group of spammers often arrive in burst and a very small fraction of the active spammers actually accounted for a large portion of the total spam mails. Based on our findings, we proposed a group-based anti-spam framework. The preliminary results show that our approach can be used as a complementary tool for existing anti-spam systems to more efficiently block organized spammers.
An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments
Yejun Wu, Douglas Oard, Ian Soboroff
The W3C mailing list test collection, an information retrieval test collection for email, was developed for the TREC Enterprise Search track in 2005. One task in that track was to retrieve emails that contribute at least one pro/con related to a specific topic. This paper describes the test collection and presents a preliminary evaluation of its suitability for evaluating such systems, including an analysis of topic types found in the collection, characterization of inter-assessor agreement on pro/con judgments, and an example of the evaluation results that can be obtained using the collection. There is clear evidence that the collection is useful in its present form, but several areas for improvement can be identified. In particular, some topic types found in the collection do not seem well suited to pro/con judgment. The paper concludes with suggestions for future work on the design of test collections and information retrieval systems for this task.
Annotating Subsets of the Enron Email Corpus
Jade Goldstein, Andres Kwasinski, Paul Kingsbury, Roberta Evans Sabin, Albert McDowell
We present an annotation project for two subsets of the Enron email corpus. The first is a subset of the UC Berkeley Enron Email Analysis Project and the second consists of a portion of emails from the Voice Transcripts Email Correlated Corpora. We annotated the emails using parts of the automatic content extraction (ACE) annotation guidelines, which we extended for the email domain. We also categorized the emails with email speech acts, marked whether they contained discussions of meetings/conversations in the text and assigned a correlation of the subject line with the text body.
Batch and Online Spam Filter Comparison
Gordon V. Cormack, Andrej Bratko
In the TREC 2005 Spam Evaluation Track, a number of popular spam filters all owing their heritage to Graham's A Plan for Spam did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique - Prediction by Partial Matching (PPM) performed exceptionally well, at or near the top of every test. Are the TREC results an anomaly? Is PPM really the best method for spam filtering? How are these results to be reconciled with others showing that methods like Support Vector Machines (SVM) are superior? We address these issues by testing implementations of five dirent classification methods on the TREC public corpus using the online evaluation methodology introduced in TREC. These results are complemented with cross validation experiments, which facilitate a comparison of the methods considered in the study under dirent evaluation schemes, and also give insight into the nature and utility of the evaluation regimens themselves. For comparison with previously published results, we also conducted cross validation experiments on the Ling-Spam and PU1 datasets. These tests reveal substantial differences attributable to dirent test assumptions, in particular batch vs. on-line training and testing, the order of classification, and the method of tokenization. Notwithstanding these differences, the methods that perform well at TREC also perform well using established test methods and corpora. Two previously untested methods one based on Dynamic Markov Compression and one using logistic regression compare favorably with competing approaches.
Breaking Anti-Spam Systems with Parasitic Spam
Morton Swimmer, Ian Whalley, Barry Leiba, Nathaniel Borenstein
The existance of networks of `bots' raises the possibility of a new type of spam that breaks the current paradigm of spam defense, in which the defence acts purely as a filter. This spam, which we call parasitic spam, looks to a filter very much like spam with scraped text, but contains instead legitimate content going from a legitimate sender to a legitimate recipient. In this problem statement, we will discuss what parasitic spam is and is not, the likelihood of parasitic spam appearing in the wild, and possible countermeasures.
CC Prediction with Graphical Models
We address the problem of suggesting who to add as an additional recipient (i.e. cc, or carbon copy) for an email under composition. We address the problem using graphical models for words in the body and subject line of the email as well as the recipients given so far on the email. The problem of cc prediction is closely related to the problem of expert finding in an organization. We show that graphical models present a variety of solutions to these problems. We present results using naively structured models and introduce a powerful new modeling tool: plated factor graphs.
Can DNS-Based Blacklists Keep Up with Bots?
Anirudh Ramachandran, David Dagon, Nick Feamster
Many Internet Service Providers (ISPs), anti-virus companies, and enterprise email vendors use Domain Name System-based Blackhole Lists (DNSBLs) to keep track of IP addresses that originate spam, so that future emails sent from these IP addresses can be rejected out-of-hand. Despite the widespread use of DNSBLs, there has not been a thorough evaluation of the effectiveness of blackhole lists in blocking spam to our knowledge. Although our previous work has briefly surveyed the completeness of DNSBLs for various types of spamming techniques (specifically, botnets, short-lived BGP routes) at the time each piece of spam was received, neither this study nor any other that we are aware of have studied the response time of DNSBLs.This paper presents a preliminary evaluation of the responsiveness of blacklists for a specific set of spamming IP addresses that are known to come from a spamming botnet that spreads via the ``Bobax'' vulnerability.
Deployment Experience: Rolling Out a New Antispam Solution in a Large Corporation
Barry Leiba, Jason Crawford
Our research group has developed new, state-of-the-art antispam software, described in other papers. We are in the process of deploying that software in a large production corporate infrastructure, as a replacement for the prior antispam solution. This paper describes how we went about the deployment, and our experiences therewith, with an eye toward pointing out the icebergs and the lifeboats, to help make the process go smoothly for others.
Dynamic Port 25 blocking to control SPAM zombies
In this paper we present the results of a technique to block the activity of SPAM Zombies without otherwise curtailing services of subscribers through generalized Port 25 blocks and without incurring an increased load on the provider’s support center.
Email Thread Reassembly Using Similarity Matching
Jen-Yuan Yeh, Aaron Harnly
Email thread reassembly is the task of linking messages by parent/child relationships. In this paper, we present two approaches to address this task. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heuristic algorithm to reassemble threads in the absence of header information. The pros and cons of both methods are discussed. The similarity matching method is evaluated using the Enron email corpus and found to perform well.
Fast Uncertainty Sampling for Labeling Large E-mail Corpora
Richard Segal, Ted Markowitz, William Arnold
One of the biggest challenges in building effective anti-spam filters is designing systems to defend against the ever-evolving bag of tricks spammers use to defeat them. Because of this, spam filters that work well today may not work well tomorrow. This adversarial nature of the spam problem makes large, up-to-date, and diverse e-mail corpora critical for the development and evaluation of new anti-spam filtering technologies. Gathering large collections of messages can be quite easy. The challenge is not in collecting enough mail, but in accurately labeling the hundreds of thousands or millions of messages as spam or non-spam. Uncertainty Sampling is a well-known active-learning algorithm which uses a machine learning model to minimize the human effort required to label large datasets. While Uncertainty Sampling is very effective, it is also computationally expensive. We propose a new algorithm, Approximate Uncertainty Sampling, which is nearly as effective as conventional Uncertainty Sampling, but has substantially lower computational complexity. The reduced computational costs allows Approximate Uncertainty Sampling to be applied to larger datasets and also makes it possible to update the learned model more frequently. Approximate Uncertainty Sampling enables the building of larger, more topical, and more realistic example e-mail corpora for evaluating new anti-spam filters.
Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically
Steve Webb, James Caverlee, Calton Pu
Web spam has emerged as one of the most significant problems currently facing search engines and Web users. However, despite the severity of this growing epidemic, research progress has been surprisingly limited. In this paper, we argue that this slow research progress is due to the lack of a large-scale, publicly available Web spam corpus. To alleviate this situation, we offer a novel method of automatically collecting Web spam examples that leverages the presence of URLs in email spam messages. Specifically, we extracted almost 1.2 million unique URLs from a collection of over 1.4 million email spam messages. Then, we obtained and processed the Web pages corresponding to those URLs. Using our new technique, we created the Webb Spam Corpus - a first-of-its-kind large-scale and publicly available Web spam data set. This corpus consists of nearly 400,000 Web spam pages, making it more than two orders of magnitude larger than any other previously cited (private) Web spam data set. We describe the pages in this corpus, and we also provide several sample application areas where we believe the Webb Spam Corpus will be especially beneficial.
Learning at Low False Positive Rates
Wen-tau Yih, Joshua Goodman, Geoff Hulten
Most spam filters are congured for use at a very low false positive rate. Typically, the filters are trained with techniques that optimize accuracy or entropy, rather than performance in this conguration. We describe two dierent techniques for optimizing for the low false positive region. One method weights good data more than spam. The other method uses a two-stage technique of rst nding data in the low false positive region, and then learning using this subset. We show that with two dierent learning algorithms, logistic regression and Naive Bayes, we achieve substantial improvements, reducing missed spam by as much as 20% relative for logistic regression and 40% for Naive Bayes at the same low false positive rate.
Modeling Identity in Archival Collections of Email: A Preliminary Study
Tamer Elsayed, Douglas Oard
Access to historically significant email archive poses challenges that arise less often in personal collections. Most notably, searchers may need help making sense of the identities, roles, and relationships of individuals that participated in archived email exchanges. This paper describes an exploratory study of identity resolution in the public subset of the Enron collection. Address-name and address-address associations in explicit, embedded and implied email headers are augmented with name and nickname associations discovered from consistent use in salutations and signatures. Limited transitive closure heuristics are employed to extend pairwise associations to richer representations of identity. Assessment of sampled results indicates that many potentially useful nontrivial associations can be detected.
Observed Trends in Spam Construction Techniques: A Case Study of Spam Evolution
Calton Pu, Steve Webb
We collected monthly data from SpamArchive over a three year period (from January 2003 through December 2005), accumulating more than 1.4M messages. Then, we conducted an evolutionary study by running 495 spamicity tests from SpamAssassin on each month. The population of messages testing positive for each spamicity test indicates the adoption of the spam construction technique associated with that spamicity test. This paper focuses on two evolutionary trends in our population study: extinction, where the population dwindles to zero or near zero, and co-existence, where the population maintains a consistent level or even grows, despite attempts by spamicity tests to eliminate it. We divide the factors that lead to extinction or co-existence into three groups: environmental changes, individual filtering, and collaborative filtering. We observed evidence of extinction (e.g., HTML-based obfuscation techniques), and somewhat unexpectedly, we observed evidence of co-existence between spam messages containing construction techniques and spamicity tests in filters (e.g., illegal characters in the ``Subject'' header and block list collaborative filtering).
Online Discriminative Spam Filter Training
Joshua Goodman, Wen-tau Yih
We describe a very simple technique for discriminatively training a spam lter. Our results on the TREC Enron spam corpus would have been the best for the Ham at .1% measure, and second best by the 1-ROCA measure. For the Mr. X corpus, our 1-ROCA measure was a close second best, and third best by the Ham at .1% measure. We use a very simple feature extractor (all words in the subject and headers). Our learning algorithm is also very simple: gradient descent of a logistic regression model.
Sender Reputation in a Large Webmail Service
In this paper, we describe how a large webmail service uses reputation to classify authenticated sending domains as either spammy or not spammy. Both SPF and DomainKey authentication are used to identify who the sender of the mail is. We describe a simple formula for how we calculate the reputation and how it is used to classify incoming mail. We show in general how domains, both good and bad, get distributed among the various reputation values, and also show the reputation values for a few specific domains. We describe some of the problems associated with this reputation system, and propose some recommendations for senders to avoid those problems.
"Sorry, I Forgot the Attachment:'' Email Attachment Prediction
Mark Dredze, John Blitzer, Fernando Pereira
Everyone knows the missing attachment problem; a single missing attachment generates a wave of emails from all the recipients notifying the sender of the error. We present an attachment prediction system to aid email users in attachment management. We present a method by which an intelligent system can inform the user when an outgoing email is missing an attachment. Additionally, the system could activate an attachment recommendation system, whereby suggested attachments are offered once the system determines the user is likely to include an attachment. We present promising initial results and discuss implications of our work.
Spam Filtering with Naive Bayes – Which Naive Bayes?
Vangelis Metsis, Ion Androutsopoulos, Georgios Paliouras
Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five dirent versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the temporal order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. Taking into account the incremental training of personal spam filters, we plot roc curves, which allow us to compare the dirent versions of nb over the entire tradeoâetween true positives and true negatives. Our experiments confirm the superiority of the multinomial event model, compared to the multi-variate Bernoulli one, and previous observations that the multinomial model surprisingly performs even better when term frequencies are removed. We also show that the multi-variate model can, in some cases, outperform the multinomial model, when its Boolean attributes are replaced by continuous ones, modelled by mixtures of normal distributions.
Spamalot: A Toolkit for Consuming Spammers’ Resources
Peter Nelson, Kenneth Dallmeyer, Lukasz Szybalski, Tom Palarz, Michael Wieher
The Spamalot system uses intelligent agents to interact with spam messages and systems referenced in spam. The goal of Spamalot is to consume spam senders’ resources by engaging the spammer in an unproductive conversation or information exchange. To date two Spamalot agents have been implemented: Arthur which handles Nigerian spam and Patsy which processes spam requesting information via web forms.
Teaching Spam and Spyware at the University of C@1g4ry
The University of Calgary is attacking the problems of spam and spyware from the angle of education. "Spam and Spyware" is a computer science course offered at both the undergraduate and graduate level, that objectively examines the legal, ethical, and technical aspects of spam and spyware and their countermeasures. We believe that this is the only course of its kind in the world. Furthermore, students are given hands-on experience in a secure laboratory, developing software for spamming, spyware, and defenses against them. This paper documents our course and its rationale.
The Effects of Anti-Spam Methods on Spam Mail
Eilon Solan, Eran Reshef
We provide a model to study the effect of three methods of fighting spam mail, on spam mail. The method are (1) increasing the cost of mailing messages, (2) filters, and (3) a do-not-spam registry. We study how these three technologies affect the number of spam messages that users receive, and the efficiency of the internet (measured by the total number of spam messages spammers send).
Using E-Mail Social Network Analysis for Detecting Unauthorized Accounts
Adam O'Donnell, Walter Mankowski, Jeff Abrahamson
In this paper we detail the use of e-mail social network analysis for the detection of security policy violations on computer systems. We begin by constructing the social networks of three organizations by analyzing e-mail server logs collected over several months. We then encode the standard usage policies associated with today's e-mail systems, such as communicating with those in the same department, using a simple language. The policies are used to detect outliers in the collected social network graph, and violators of the policy are nodes which are disconnected from the main graph component. After closer examination of the outlier accounts, we find that a significant fraction of the suspect accounts were supposed to have been terminated long ago for a variety of reasons. Through the analysis and experiments presented in the paper, we conclude the analysis of social networks extracted from network logs can useful in a variety of traditionally hard to solve security problems, such as detecting insider threats.
Using Early Results from the 'spamHINTS' Project to Estimate an ISP Abuse Team's Task
ISPs operate "abuse" teams to deal with reports of inappropriate email being sent by their customers. Currently, the majority of their work is dealing with insecure systems that have become infected with viruses or that have been hijacked by the senders of email `spam'. This paper examines the performance of an abuse team at a large UK ISP over the past few years, and shows that email log processing tools have provided significant improvements in their efficiency. A new email measurement system called spamHINTS, using sampled sFlow packet header data from a major Internet exchange, is currently under development. Early results from this monitoring suggest that the ISP abuse team will need to step up their activity by an order of magnitude to get on top of their problem.