Predicting Email Recipients in the Enron Corpus  

Carnegie Mellon University
11-742: Information Retrieval Lab
Student: Vitor R. Carvalho
Fall 2006

Abstract

Email is the most popular communication tool on the web. To improve the way we handle email messages, machine learning techniques have been proposed in different areas, from adaptive spam filtering to automated message foldering (i.e., predicting the correct folder to store a message). One of the ideas recently proposed in the area of email automation is to predict the recipients of an already composed message; a problem also referred to as CC Prediction problem. The goal of this problem is to predict the most likely recipients for a message, given its current text and given the email addresses already specified. If successfully automated, this idea can be a valuable addition to email clients, particularly in the large corporations. It can prevent a user from forgetting to add an important collaborator or manager as recipient. Also, it can be used to identify people in an organization that are working in a similar topic or project, or to find people with appropriate expertise or skills. In this work we investigate this problem using the textual contents and social network information features from a large real world collection of email messages, the Enron Email corpus [1]. Using a classification-based reranking scheme to combine the two different types of features, results indicate that we can correctly predict CC-recipients in more than 56\% of the test cases.

Timeline

Task
Done by
Status
Proposal and work plan Oct. 02 Complete
Literature review of Related Work Oct. 07 Complete
Parsing of Enron Data Oct. 15 Complete
Build text-based model and evaluation tools Oct. 31 Complete
Preliminary results: evaluation of text-based model Oct. 17 Complete
Integrating non-textual features Nov. 21 Complete
Final results and analysis Nov. 30 Complete
Presentation Dec. 18 Complete
Documentation Dec. 20 Complete

Introduction, Related Work and Proposal

Machine learning techniques have been successfully applied to email communication recently. Some of the most well known applications are adaptive spam filtering [3], email foldering [6], email filtering [7], automatic integration with address book[10] and automatic integration with to-do lists[11]. In this work we focus on another application to improve email clients: automatically predicting email recipients, a.k.a. the CC Prediction problem.The CC Prediction problem is the task of predicting the recipients (TO, CC or BCC) of an email message already composed. The prediction is typically based on the textual contents of the message, as well as on the presence of other email addresses as recipients [2].

The CC Prediction problem is closely related to the Expert Finding task in Email. The Expert finding in email is defined as the task of finding expertise using only the email messages exchanged inside an organization, as described by  Dom et al.[4], Campbell et al. [5] and Balog & de Rijke [13]. This area of research has received considerable attention from the TREC community, and an expert-finding task has been run under the TREC Enterprise track [12] since 2005. 

Another important area of work related to this problem is collaborative filtering or recommending systems [8, 9]. The CC Prediction problem can be seen as a special type of recommendation system where email recipients are recommended based on the current message being composed and the previous messages exchanged among users.

The CC Prediction problem was initially described by Pal & McCallum [2], where Naive Bayes and Factor Graphs models were used to predict recipients in a personal collection of emails. In this work we intend to extend Pal & McCallum's ideas in three ways: we implement different baseline algorithms for the problem, we test the algorithms on a considerably larger real data collection (the Enron Email Corpus [1]), and evaluate the performance using a more comprehensive set of metrics. The three extensions are described in more details below.

There are a number of different algorithms to address the CC prediction problem. Some of the baselines we plan to add are: (1) a random baseline, (2) a simple TF-IDF cosine similarity between user's communications, (3) a learning algorithm score (for instance, KNN or Perceptron) between the current message and previous messages and (4) a reranking-based method using (2) or (3) as base score in addition to non-textual features such as relative frequency of sent messages, received messages, etc.

The main motivation to attempt this problem in the Enron Dataset is that, since email is a noisy and the email usage varies considerably among users, it is important to study this problem on a larger collection of data. In other words, we expect that the CC Prediction task will vary considerably among different Enron users. Not only do different users store distinct amounts of messages, but also different ways to use email (for instance, secretaries typically exchange email messages with more people than technical positions).

In order to obtain labeled data, a straightforward procedure is to remove one or more of the "true" CC-recipients from the original message, and then consider this address as the label to be predicted. The output of the algorithms will output a ranked list with the most likely recipients for each message in a test collection. To evaluate performance, we intend to use three different methods: average precision at rank 1, average rank and precision-recall curves. These methods in combination should reveal more the nature of the task.

Dataset, Parsing, Methods, Results and Analysis can be found in the final report.

Final Report

The final report can be found here.

References

[1] Introducing the Enron Corpus, Bryan Klimt & Yiming Yang, CEAS 2004

[2] CC Prediction with Graphical Models, Chris Pal & Andrew McCallum, CEAS 2006

[3]  On-line Supervised Spam Filter Evaluation, G. Cormack & T. Lynam, ACM Transactions on Information Systems, to appear, 2006.

[4]  Graph-Based Ranking Algorithms for E-mail Expertise Analysis Byron Dom, Iris Eiron, Alex Cozzi & Yi Zhang, Data Mining and Knowledge Discovery Workshop(DMKD2003) in ACM SIGMOD 2003

[5] Expertise identification using email communications, Christopher S. Campbell, Paul P. Maglio, Alex Cozzi & Byron Dom, CIKM 2003

[6] The Enron Corpus: A New Dataset for Email Classification Research B. Klimt and Y. Yang., ECML 2004

[7] An Application of Machine Learning to E-Mail Filtering., J. Rennie,  iFile:  In Proc. KDD00 Workshop on Text Mining, Boston, 2000.

[8] Empirical Analysis of Predictive Algorithms for Collaborative Filtering, J. S. Breese, D. Heckerman and C. Kadie, UAI, 1998.

[9] A C/Matlab Toolkit for collaborative filtering, http://www-2.cs.cmu.edu/~lebanon/IR-lab.htm, 2003.

[10] Extracting social networks and contact information from email and the Web,  Aron Culotta, Ron Bekkerman & Andrew McCallum, CEAS 2004.

[11] Detecting Action-Items in Email , Paul N. Bennett and Jaime Carbonell, SIGIR 2005.

[12] Enterprise track, 2005. URL: http://www.ins.cwi.nl/projects/trec-ent/wiki/.

[13] Finding Experts and their Details in E-mail Corpora, K. Balog and M. de Rijke. In: 15th International World Wide Web Conference (WWW2006), 2006.