Talk Schedule Question Answering from Personal Email

IR Lab Project 2005
Bryan Klimt

Abstract

For my Information Retrieval Lab, I have designed and implemented an end-to-end system for answering questions related to scheduling information, particularly for information related to where and when academic talks and seminars will be held in a college environment. While Question Answering (QA) has been studied extensively, its use so far has been using large general purpose text corpora to find textbook-like answers. This paper will explore how adaptable the current techniques are to a personal email dataset, for finding answers specific to a particular user. In order to evaluate this system, I have labelled the information related to seminars in a large dataset of my own email (4799 messages), and measured both the quality of the system's answers and the performance of the system's individual modules. For many of the systems' subcomponents, I have evaluated several approaches that could be used (such as statistical vs. NLP), and have provided some observations as to which are more appropriate in certain situations.

My academic goal for this project is to learn more about Question Answering and Information Extraction. I also hope that what I learn about personal schedule information extraction is useful to the overall Radar project.

(Note: There are also presentation slides [672K PPT] which accompany this paper.)

Introduction

Question Answering (QA) is a model for information retrieval whereby the user asks a question in natural language, and the system consults a large corpus in order to provide an answer . Modern QA systems, such as CMU's JAVALIN and IBM's PIQUANT , employ several techniques through modular frameworks in order to find the most suitable answer to each question. These techniques include traditional information retrieval, information extraction, natural language processing, and knowledge representation. Traditional information retrieval can be used to search through the corpus and find the most relevant snippets to a question. However, these systems often fail for QA when the answer is not given specifically in the text, or when the question does not form a good query (Examples of this will be given below.) Information Extraction can be used to improve the quality of IR systems for QA though. For instance, if named entities or numerical quantities have been extracted from a corpus, and we know the answer must belong to one of those categories, then this can be used to make the system more accurate. Also, IE techniques can be used to fill templates, from which answers can be found using special search techniques, such as in .

Natural language processing can be used in conjunction with these techniques, especially for question analysis. Often, a shallow parse of a question is performed in order to map the question into an answer-type hierarchy, so that the type of data being searched for can be inferred. Finally, knowledge representation can be used to "sanity check" the answer retrieved to make sure that it is reasonable. (Knowing, for example, that 0.63 is not reasonable for the population of New York City.) KR systems can also be used to bypass the normal IR pipeline, in the case where they already contain the correct answer.

System Architecture

The system developed for this project implements many of these approaches, in a modular framework somewhat similar to the systems listed above. The overall design on the system is given in figure 1, and details are given in later sections. In order to reduce the scope of this project to be reasonable for a semester course, I have only concentrated on the specific domain of answering questions related to academic lectures being given. Thus, the first stage in my system is to classify incoming emails as to whether or not they contain a lecture announcement or correction. If the email does not contain any such information, it is discarded. Otherwise it is sent down the pipeline.

Emails with seminar information are then sent to a module that performs information extraction routines on them in order to annotate relevant fields, such as speaker name, talk title, and location. Currently, the module implemented to perform this task uses a rule-based approach. However, statistical methods could be used, and both ideas are explored below. After the email is annotated, it is either loaded into an inverted index, or sent to a template-filler module to be indexed in a special way, or both, depending upon user preference. A more advanced system would use both modules, and choose the correct module on a question-by-question basis, using confidence metrics. However, that functionality has not yet been implemented, and the two pipelines are used for a side-by-side comparison.

When the user issues a question, one of the two answering pipelines is invoked. The NLP pipeline uses NLP techniques to parse the question and translate it to a structured query, which can be executed against the seminar database to retrieve the answer. The statistical pipeline uses simple heuristics to determine the correct answer type, and then uses traditional information retrieval techniques to determine the answer. All of these modules are described in greater detail below.

Dataset

One of the most difficult tasks for this project was creating a dataset to use in the development and evaluation of the system. To achieve this, I collected all of my emails from September 2003 to February 2005 (18 months), resulting in 4799 messages. I then hand labelled these emails as either lecture-related or not. In the end, 196 of the emails contained talk information. Unfortunately, due to legal and privacy concerns, I cannot make this corpus available for download.

After selecting these emails, a volunteer and I hand-annotated the corpus to create ground-truth data, using Minorthird tools. We also filled seminar templates by hand. My volunteer (who has no experience with QA or IR) then wrote 478 questions requesting information about upcoming lectures, and supplied the correct answer for each. Since these questions were written by someone without a knowledge of the easy or difficult parts of question answering, they should be a fair test of the system.

Classifier

In order to know from which emails we should extract data, we need to know which emails are talk announcements. For this, we need to apply a statistical email classifier. Many machine learning algorithms have been applied to the problem of email classification. However, our experience has been that a combination of logistic regression classifiers performs best for both foldering task and general topic classification, such as that performed in the Radar Project. To be sure, I tried several classifiers for this classification module, and again found the combination of logistic regression classifiers to be best. Thus, I have used the same classification framework described in , but with logistic regression used in place of SVM.

To evaluate the classifier, we simulate a stream of emails arriving at the user's computer, with the system classifying them, and then giving the user a chance to correct the system's prediction. This way, we get a good idea of how the system will perform in practice, even if it is initially given no training data.

Out of the 4799 emails, the classifier correctly identified 129 of the talk announcements, with 67 misses and 31 false positives. This means it had a precision of 0.81 and recall of 0.66, which is F1=0.72. While this would be good performance for many email classification tasks, it is surprisingly low for this particular task. For instance, found that a bag-of-words Naive Bayes classifier could classify lecture emails with a precision of 0.81 and a recall of 0.96, giving F1=0.88. Furthermore, emails announcing talks contain many words such as "lecture" and "seminar" that should be fairly indicative of the topic. Indeed, even an examination of the discriminative features of the emails does not give a clear picture of why performance is so low. According to chi-squared correlation, the most informative words in this dataset are as given in Table 1. These words are consistent with our idea of the content of talk announcements, and are surprisingly noise-free. However, the difference in performance with the previous research results could be easily explained by the different approaches to train/test splitting, and determining why this classification is so difficult will be an ongoing research question.

abstract, bio, speaker, copeta, multicast, esm, donut, talk, seminar, cmtv, broadcast,
speech, distinguish, ph, lectur, ieee, approach, translat, professor, award
Table 1: Informative Features

Annotators

To answer questions about upcoming talks, we need to know more than which emails discuss those talks; we need to know about the talks themselves. In other words, we need to extract specific pieces of information about each talk, such as speaker name, etc. This is called Information Extraction (IE). IE in formal documents, such as news articles, has been studied extrensively, and two main approaches have been taken. One is hand-writing rules to recognize; the other is using statistical labelling models, such as Hidden Markov Models or Conditional Random Fields. However, extraction from "informal" documents, such as email, has been studied in much less detail. Frequent misspellings, odd grammar, and inconsistent formatting make this task much more difficult.

Extraction of seminar information in particular from emails has only been studied thoroughly in one paper for a course project at Stanford . In that paper, they focused on extracting the date, time, location, title, speaker name, speaker affiliation, and type of each seminar, which will be the same fields used in this paper. They evaluated extracting these fields using both hand-written regular expressions, and using hidden Markov models (HMMs). They found that, in most cases, the hand-written rules were more accurate at recognizing these fields than HMMs. However, more recent works, such as and have shown that more modern sequential classification techniques, such as Conditional Random Fields (CRFs) can perform quite well at information extraction tasks similar to these. In this paper, I have implemented both hand-written rules and CRFs to compare their performance.

Rule-based Annotator

Writing rules to match annotations is a somewhat time-consuming task, but yields good performance because of the opportunity for adding domain knowledge to the system. For instance, since we know that September is a month, we know that "September" following a number generally forms a date. It might take a statistical system many training examples to learn this correlation, but we can easily write a rule to express the relationship. For example, the rule used in this project for matching dates of that form is:

    defSpanType date =: ... [ re('^\d\d?$') ai(dayEnd)? ai(month) ] ...;

In total, it took about 2 working days to write all of the rules used in this project for matching the fields listed above, which is a fairly small investment if the fields can be generalized for use in other domains. The rules were written in Mixup, an information extraction language that is part of CMU's MinorThird toolkit, which combines regular expressions with dictionary lookups. The full list of rules is available here: Annotator.mixup

The easiest fields to annotate were dates and times, which have very consistent format, and use a few small dictionaries (such as for days of the week and month names). These fields should generalize quite well to other domains and settings. Slightly more difficult were locations and names. Matching locations (such as room numbers) should generalize to other office domains, but requires a dictionary of building names specific to a certain setting. Fortunately, setting-specific dictionaries are often small and can often be obtained easily. In fact, the list of CMU's buildings is already encoded in Radar, making this task trivial. The most difficult fields to extract in this project were seminar titles and speaker affiliations, because they lack consistent formatting, and have few words in common.

Typerecallprecisionf1
date0.9970.9970.997
time1.0000.9920.996
location0.9871.0000.994
title0.3141.0000.478
series0.2610.9740.412
name0.2770.7590.406
affiliation0.0000.0000.000
Table 2: Rule-based Annotator Performance

Statistical Annotator

Based on the results in and , I decided to compare these results with those of Conditional Random Fields. The advantage of using CRFs over hand-written rules is obviously that writing rules is a tedious, time-consuming task. The disadvantage though, is that CRFs require labelled training data to learn from, and labelling training data is a tedious, time consuming task. Given the need for training data, the CRF module could not be evaluated over the whole corpus, as the rule-based module was. So, I used a 6-fold split, using each chunk 17% as testing data given a model training on the other 83%. The results are given in table 3.

Typerecallprecisionf1
location0.9390.9930.965
series0.8810.9600.919
time0.9910.8260.901
name0.7470.9350.830
title0.7240.9130.808
affiliation0.7310.8410.782
date0.9320.6340.757
Table 3: CRF Annotator Performance

had found that HMMs performed considerably worse than hand-written rules for every type of data except TIME. However, my results find them to perform quite well. This is most likely due to the fact that I use a much larger dataset, training on ~160 emails, rather than 75. Also, of course, CRFs are known to work better than HMMs, although the difference in scores here is surprising.

My conclusion here is that the best approach would be to use both methods. For dates, times, and locations, it is incredibly easy to write rules that perform nearly perfectly. The other fields, such as names and titles, should be left to machine learning, as the results with the hand-written rules are just unacceptable.

Hybrid Annotators

Another, even better approach would be that given in . We could first apply the rule-based annotator, and then use its labelling as features for the statistical learner, along with other email-specific features. This feature engineering would be a bit time-consuming, but in the end it would produce the best, most generalizable, results.

Template Filler

The next stage in the pipeline is the seminar template filler. This module uses a simple algorithm to take the information extracted in the previous modules and fill in a template for each talk being given. The algorithm just loops the unique items of each type and fills in a new template for each name or title. (To see the specific algorithm, look in SeminarExtractor.java.) In order to evaluate this module, I used the hand-labelled annotations, ran the template filler, and compared the output templates with those created by hand. The detailed results are available here. I measured the recall and precision of the module in two ways. The first row in table 4 gives the metrics when templates only need to be mostly correct to be given credit. The second row requires the entire record to be perfect to get credit.

Typerecallprecisionf1
partial0.9621.0000.981
full0.6371.0000.778
Table 4: Template Filler Performance

With 151 correctly filled templates, 77 partials, 9 misses, and 0 false alarms, the performance of the template filler is quite good, especially considering it was written in just a few hours. With these records filled, modules further down the pipeline can consult them without having to deal with the messy email text.

NL Question Analyzer

With the template filler, we have finished the task of indexing the data. We will know move to the other end of the pipeline, where the user enters his or her question. Once the question has been entered, it must be analyzed and the answer must be chosen. First, I will look at ways of using NLP to analyze the question and select an answer.

The NLP Question Analyzer works by performing a full parse of the question using a unification-based parser. The grammar took 1 day to write using Lisp for the Tomita parser, and can be downloaded here. The grammar was designed to account for terms not found in the lexicon using wildcards, and was carefully constructed such that the resulting f-structure is actually a structured query which can be used on the seminar template database. In essense, this module machine translates a subset of English to a structured query language (similar to SQL). For example, one of the questions, "In which room is the Tuomas Sandholm lecture?" produces the parse:

    ((FIELD LOCATION)
     (FILTER (*OR*
                   (*AND* (NAME "TUOMAS SANDHOLM"))
                   (*AND* (NAME "TUOMAS")(SERIES "SANDHOLM"))
                   (*AND* (SERIES "TUOMAS SANDHOLM")))))
			

NL/KB Answer Extractor

The NLP question analyzer cannot be easily evaluated by itself. However, since the NL Answer Extractor just takes the query produced by the question analyzer and runs it against the seminar database, an evaluation of the answers produced will give us a good idea of how well the question analyzer performed. The full results are available here, but the system had an accuracy of 0.870. In other words, it answered 87% of the questions correctly. It seems this is a high enough quality to be useful, especially when paired with an interface that allows the user to view the reason for the decision and judge the correctness for himself. Analysis of the answers reveals that most of the errors are due to incorrect parsing of questions. Of course, this could be fixed with more engineering, but it is evidence of the lack of generalization with such NLP-based systems. A few more questions were missed because of string mismatches in the structered queries. For instance, the question "Where is the lecture on dolphin language?" would produce the query ((FIELD LOCATION)(FILTER (TITLE "DOLPHIN LANGUAGE"))). Unfortunately, this query fails to find the correct talk, titled "NATURAL HISTORY AND COMMUNICATION OF FREE-RANGING ATLANTICSPOTTED DOLPHIN, STENELLA FRONTALIS, IN THE BAHAMAS".

Statistical Question Analyzer

The Statistical Answer Extractor pipeline works differently, of course, than the NLP pipeline by relying on traditional IR and open domain QA techniques, rather than on heavily-engineered parsing. First, the question is analyzed by an incredibly shallow parse in order to determine the type of answer needed. Then a search engine is used to find the a string of that type near the other words in the question.

The question analysis module for this part of the pipeline is so simple; all it does it look at the beginning of the question to decide the answer type. In fact, in just 15 lines of code, it had 100% recall and precision. Here is the actual code:

        if ( question.matches("^WHEN .*$") ) { return "DATETIME"; }
        if ( question.matches("^WHERE .*$") ) { return "LOCATION"; }
        if ( question.matches("^IS .*$") ) { return "EXIST"; }
        if ( question.matches("^ARE .*$") ) { return "EXIST"; }
        if ( question.matches("^WHO .*$") ) { return "NAME"; }
        if ( question.matches("^WHAT SERIES .*$") ) { return "SERIES"; }
        if ( question.matches("^WHAT TIME .*$") ) { return "TIME"; }
        if ( question.matches("^WHAT DAY .*$") ) { return "DATE"; }
        if ( question.matches("^WHAT DATE.*$") ) { return "DATE"; }
        if ( question.matches("^IN WHICH ROOM .*$") ) { return "LOCATION"; }
        if ( question.matches("^WHAT IS THE DATE .*$") ) { return "DATE"; }
        if ( question.matches("^WHAT IS THE TITLE .*$") ) { return "TITLE"; }
        if ( question.matches("^WHAT IS THE AFFILIATION .*$") ) { return "AFFILIATION"; }
        if ( question.matches("^WHAT IS THE TOPIC .*$") ) { return "TITLE"; }
        if ( question.matches("^WHAT IS .*'S AFFILIATION.*$") ) { return "AFFILIATION"; }
			

Statistical Answer Extractor

Next comes the more difficult part, the answer extraction. In order to perform this task, I adapted the TF-IDF based search engine written for the IR course. It uses term weights as given in equations 1, 2, and 3. below.

The question is supplied as a query, and documents are returned, along with positions of the matched terms, as shown below:

    0.20006221807435057	218.txt	
    	search[580:586]: tuomas
    	search[2671:2677]: tuomas
    	search[587:595]: sandholm
    	search[919:927]: sandholm
    	search[2678:2686]: sandholm
    	search[3482:3490]: sandholm
    0.20006221807435057	335.txt	
    	search[585:591]: tuomas
    	search[2676:2682]: tuomas
    	search[592:600]: sandholm
    	search[924:932]: sandholm
    	search[2683:2691]: sandholm
    	search[3487:3495]: sandholm
    0.1962166285278413	373.txt	
    	search[762:768]: tuomas
    	search[2913:2919]: tuomas
    	search[769:777]: sandholm
    	search[1118:1126]: sandholm
    	search[2920:2928]: sandholm
    	search[3737:3745]: sandholm
			

Then, the documents are scanned to find the annotation of the correct type that appears closest to the terms found by the search engine. These annotations are then returned to the user. The full evaluation of this algorithm is available here, but the system had an overall accuracy of 0.755. Analysis of this data revealed that the questions missed were different from those missed by the NLP approach. In particular, the search engine basically always answered "YES" to yes/no questions, because there was always some word that would match a word in the question. Also, questions with dates were difficult (although they were handled quite well by the NL answerer). This was due to the nature of bag-of-words searching. A search for "November 10" would rank "10 am September 10" higher than "Nov 10", thus finding the wrong email.

On the other hand, the title questions which were difficult for the other answer extractor were handled well by this approach, even finding the correct email for the "dolphin" example given earlier. This indicates that, unsurprisingly, a system which uses both techniques together would outperform either alone.

Implementation Details

The source code for this project is a combination of Java, C++, and Lisp. It has been tested using Linux, Windows XP (with Cygwin), and Mac OS X. The code is available on my website. However, it depends on Minorthird, which is freely available, and the Tomita parser, which is available to CMU students and faculty, but otherwise restricted. The classifier used is available in binary form from my Radar website.

Conclusions

While this project does not propose any exciting new approaches or breakthrough experimental results, I have demonstrated how many language technologies, such as Information Retrieval, Information Extraction, Natural Language Processing, Text Classification, Sequence Classification, and even Machine Translation can be used to tackle the difficult IR problem of email Question Answering, at least for a specific domain. I have also implemented a practical system, which could hopefully someday be adapted for real use.

There are some conclusions which can be drawn about which modules perform better in which circumstances as well. For highly structured fields, such as dates, times, and locations, hand-written rules are clearly the way to go for an automatic annotator. They are easy to write and have nearly perfect performance. For less well-structured fields, such as names, titles, affiliations, a conditional random fields learner will perform much better, replacing the difficult rule-writing task with the easier job of labelling a subset of data.

Likewise, different questions are handled better by different approaches. Yes/No questions and questions containing dates and times, which are handled clumsily by traditional IR systems, are much better handled by deep parsing and searching filled templates. However questions containing loosely structured text such as titles are better handled by TF-IDF searching. Also, the search-based approach takes much less time to implement, so it should be chosen if time is a major factor.

In all, the overall performance of this system is good enough that it could be put to real use, especially when part of an interface that easily allows the user to explore the data further and verify the correctness of answers produced.

Future Work

Future work mainly consists of finding new ways to improve each of the modules. For one, a planner should be introduced which can decide which pipeline could better handle any particular question. Also, work needs to be done to further tune each module. Providing the classifier with a feedback loop from the automatic annotator could improve classification performance. Also, different TF-IDF weights could be tried to improve the performance of the search-based pipeline.

References