Classification, Language Analysis and Information Retrieval
  Language Technologies Institute, School of Computer Science  
 

Carnegie Mellon University

 
Members Group Seminars Library Projects Datasets


The current projects in CLAIR group have been listed below along with their brief descriptions.
Principle Investigator for all the projects under the CLAIR group is Professor Yiming Yang.

1. GALE (Global Autonomous Language Exploitation) - Distillation

The goal of the GALE program is to develop and apply computer software technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. As part of GALE's Distillation Engine, the CLAIR group has created CAFE - CMU's Adaptive Filtering Engine, which effectively combines adaptive filtering, passage retrieval, and novelty detection for utility-based information distillation. CAFE learns long-lasting information needs of users from fine-grained user feedback over a temporal sequence of ranked passages. Users can choose to suppress information they already know about, or information already presented to them by the system in the past. CAFE has received several positive reviews from experienced intelligence analysts in recent evaluations.
http://www.darpa.mil/ipto/programs/gale/

    Project Members: Abhimanyu Lad, Abhay Harpale, Bryan Kisiel


2. RADAR (Reflective Agents with Distributed Adaptive Reasoning)
 
The RADAR project is aimed at helping its human master with tasks like scheduling meetings, allocating resources, creating coherent reports from snippets of information, managing email by grouping related messages, flagging high priority requests and automatically proposing answers to routine messages.It applies learning technology to e-mail, calendars, web sites, etc to create a pro-active assistant whose goal is to improve human productivity duration of the project and to develop a system that can both save time for its user and improve the quality of decisions. Our contribution in the project is biulding a robust classifier (adaptive Logistic Regression classifier) that assigns one of the nine predefined categories to the Emails.
http://radar.cs.cmu.edu/

      Project Members: Shinjae Yoo,Sachin Agarwal, Ni lao

3. Machine Learning Approaches to Proteomic Problems

Tandem mass spectrometry (MS-MS) enables powerful new proteomics approaches to discover biomarkers of disease, therapeutic responses and toxicity. The objective of this project is to create a comprehensive proteomics platform for proteome characterization and biomarker discovery. We adopt a three stage approach to provide a high degree of interpretability to the results. The goals of the stages respectively are

1) to develop algorithms and software for high recall identification of peptide sequences from multiplexed MS-MS spectra by database searching,
2) to effectively combine evidence from the previous step to build reliable protein identification systems and
3) to develop techniques for multivariate comparisons of complex proteome data sets representing different biological states.

The last step will employ a a three-layered classification approach, which integrates multiplexed MS-MS spectra, peptide identifications, and protein identifications.

       Project Member: Subramaniam Ganapathy

4. Graph Learning

We have developed novel probabilistic generative models and learning algorithms for the discovery of graph structures and relational links among objects and classes of objects (genes, documents, words, topical categories, etc.). We impose a global hierarchical Bayesian prior (ARD-style Wishart prior) to the precision matrix of the Graphical Gaussian Model (GGM) for learning sparse graphs, and we extend lasso regression to enable global optimization with a quadratic time complexity. Main applications of this new framework include the learning of gene regulatory networks based micro-array data of gene expressions, and the discovery of topic networks based on both the class labels of documents, the word-level similarities among documents, and possibly link-associated features (anchor text or link labels). Our experiments show significant improvements in prediction accuracy when using the induced networks to analyze the functions of genes and to classify documents. Our approach is scalable to very large datasets where previous solutions have major scaling issues. We have successfully learned genome regulatory networks with about 20,000 nodes, for example.

    Project Member:  Fan Li

5. F/A-18 Automatic Maintenance Environment

The goal of our current F/A-18 project is to help user fill out a complex form in order to increase their productivity. We are building a recommendation system that would help the military personnal reduce their monetary expenditures and efforts in F/A-18 aircraft maintainance tasks. Potentially we can also help troubleshoot hard problems, or solve simple problems without seeking engineers for help. We have done some work using Logistic regression to predict some fields in the form based on user's problem description. In the next stage we aim to use EVSM model to solve the problem more flexibly with given information of any type (for example the author's name only), give suggestion to any other types (for example a title text in the past cases, or priority).

    Project Members:  Sachin Agarwal, Ni lao, Shinjae Yoo

6. Email Prioritization

This project is aimed at developing a system that learns to organize a user’s emails by their priority. The system will learn the user’s priorities by observing his or her interaction with the email client (which email gets read or replied to first, which gets deleted without reading, which set left in the mailbox for a long time and so on), in addition to sporadic, explicit feedback from the user. Using this information, the system will learn to predict the priority of an incoming message from its contact.
The main risk to the subject from participation in this study are embarrassment and the loss of employment or social standing as a result of the accidental disclosure of confidential information contained in one or more email messages. Every effort will be made to protect the privacy of subjects participating in this study, including the anonymization of all personally identifying information (names, email addresses, etc.), but some risk of privacy loss will always remain. The anonymization of the personally identifying information would be done automatically by our system and the user will be able to anonymize any further information in his/her emails which he feels necessary to be anonymized and has not been taken care of by our system. The system will then learn such additional information and anonymize the similar text in future without bothering the user.

     Project Members: Shinjae Yoo, Sachin Agarwal (partly)


Classification, Language Analysis and Information Retrieval (CLAIR)
Language Technologies Institute (LTI), School of Computer Science (SCS)
Carnegie Mellon University (CMU), Pittsburgh, PA 15213, USA