Predicting Email Recipients in the Enron Corpus
Carnegie Mellon University
11-742: Information Retrieval Lab
Student:
Vitor R. Carvalho
Fall 2006
Abstract
Email is the most popular communication tool on the web. To
improve the way we handle email messages, machine learning techniques
have been proposed in different areas, from adaptive spam filtering to
automated message foldering (i.e., predicting the correct folder to
store a message). One of the ideas recently proposed in the area of
email automation is to predict the recipients of an already
composed message; a problem also referred to as CC Prediction problem.
The goal of this problem is to predict the most likely recipients for a
message, given its current text and given the email addresses already
specified. If successfully automated, this idea can be a valuable
addition to email clients, particularly in the large corporations. It
can prevent a user from forgetting to add an important collaborator or
manager as recipient. Also, it can be used to identify people in an
organization that are working in a similar topic or project, or to find
people with appropriate expertise or skills. In this work we
investigate this problem using the textual contents and social network
information features from a large real world collection of email
messages, the Enron Email corpus [1]. Using a classification-based
reranking scheme to combine the two different types of features,
results indicate that we can correctly predict CC-recipients in more
than 56\% of the test cases.
Timeline
Task |
Done by |
Status |
| Proposal and work plan |
Oct. 02 |
Complete
|
| Literature review of Related Work |
Oct. 07 |
Complete
|
| Parsing of Enron Data |
Oct. 15 |
Complete
|
| Build text-based model and evaluation tools |
Oct. 31 |
Complete
|
| Preliminary results: evaluation of text-based model |
Oct. 17 |
Complete |
| Integrating non-textual features |
Nov. 21 |
Complete
|
| Final results and analysis |
Nov. 30 |
Complete
|
| Presentation |
Dec. 18 |
Complete |
| Documentation |
Dec. 20 |
Complete
|
Introduction, Related Work and Proposal
Machine learning techniques have been successfully applied to email
communication recently. Some of the most well known applications are
adaptive spam filtering [3], email foldering [6], email filtering [7],
automatic integration with address book[10] and automatic integration
with to-do lists[11]. In this work we focus on another application to
improve email clients: automatically predicting email recipients,
a.k.a. the CC Prediction problem.The CC Prediction problem is the task
of predicting the recipients (TO, CC or BCC) of an email message
already composed. The prediction is typically based on the textual
contents of the
message, as well as on the presence of other email addresses as
recipients [2].
The CC Prediction problem is closely related to the Expert Finding
task in Email. The Expert finding in email is defined as the task of
finding expertise using only the email messages exchanged inside an
organization, as described by Dom et al.[4], Campbell et al. [5]
and Balog & de Rijke [13]. This area of research has received
considerable attention from the TREC community, and an expert-finding
task has been run under the TREC Enterprise track [12] since 2005.
Another important
area of work related to this problem is collaborative filtering or
recommending systems [8, 9]. The CC Prediction problem can be seen as a
special type of recommendation system where email recipients are
recommended based on the current message being composed and the
previous messages exchanged among users.
The CC Prediction problem was initially described by Pal & McCallum
[2], where
Naive Bayes and Factor Graphs models were used to predict
recipients in
a personal collection of emails. In this work we intend to extend Pal & McCallum's
ideas in three ways: we implement different baseline algorithms for the
problem, we test the algorithms on a considerably larger real data
collection (the Enron Email Corpus [1]), and evaluate the
performance using a more comprehensive set of metrics. The three
extensions are described in more details below.
There are a number of different algorithms to address the CC prediction
problem. Some of the baselines we plan to add are: (1) a random
baseline, (2) a simple TF-IDF cosine similarity between user's
communications, (3) a learning algorithm score (for instance, KNN or
Perceptron) between the current message and previous messages and (4) a
reranking-based method using (2) or (3) as base score in addition to
non-textual features such as relative frequency of sent messages,
received messages, etc.
The main motivation to attempt this problem in the Enron Dataset is
that, since email is a noisy and the email usage varies considerably
among users, it is important to study this problem on a larger
collection of data. In other words, we expect that the CC Prediction
task will vary considerably among different Enron users. Not only
do different users store distinct amounts of messages, but also
different ways to use email (for instance, secretaries typically
exchange email messages with more people than technical positions).
In order to obtain labeled data, a straightforward procedure
is to
remove one or more of the "true" CC-recipients from the original message,
and then consider this address as the label to be predicted. The output
of the algorithms will output a ranked list with the most likely
recipients for each message in a test collection. To evaluate
performance, we intend to use three different methods: average
precision at rank 1, average rank and precision-recall curves.
These methods in combination should reveal more the nature of the task.
Dataset, Parsing, Methods, Results and Analysis can be found in the final report.
Final Report
The final report can be found here.
References
[1] Introducing the Enron Corpus, Bryan Klimt & Yiming Yang, CEAS 2004
[2] CC Prediction with Graphical Models, Chris Pal & Andrew McCallum, CEAS 2006
[3]
On-line Supervised Spam Filter Evaluation, G. Cormack & T. Lynam,
ACM Transactions on Information Systems, to appear, 2006.
[4] Graph-Based Ranking
Algorithms for E-mail Expertise Analysis,
Byron Dom, Iris Eiron, Alex Cozzi & Yi Zhang, Data Mining and
Knowledge Discovery Workshop(DMKD2003) in ACM SIGMOD 2003
[5] Expertise identification using email communications, Christopher S. Campbell,
Paul P. Maglio,
Alex Cozzi &
Byron Dom, CIKM 2003
[6] The
Enron Corpus: A New Dataset for Email Classification Research , B.
Klimt and Y. Yang., ECML 2004
[7] An Application of Machine Learning to E-Mail Filtering., J. Rennie, iFile: In Proc. KDD00 Workshop on Text Mining, Boston, 2000.
[8] Empirical
Analysis of Predictive Algorithms for Collaborative Filtering, J. S. Breese, D. Heckerman and C. Kadie, UAI, 1998.
[9] A C/Matlab Toolkit for collaborative
filtering, http://www-2.cs.cmu.edu/~lebanon/IR-lab.htm, 2003.
[10] Extracting social networks and contact information from email and the Web, Aron Culotta, Ron Bekkerman & Andrew McCallum, CEAS 2004.
[11] Detecting Action-Items in Email , Paul N. Bennett and Jaime Carbonell, SIGIR 2005.
[12] Enterprise track, 2005. URL: http://www.ins.cwi.nl/projects/trec-ent/wiki/.
[13] Finding Experts and their Details in
E-mail Corpora, K. Balog and M. de Rijke. In: 15th
International World Wide Web Conference (WWW2006),
2006.