The current projects in CLAIR group have been listed below along with
their brief descriptions.
Principle Investigator for all the projects
under the CLAIR group is Professor Yiming
Yang.
1. GALE
(Global Autonomous Language Exploitation) - Distillation
The
goal of the GALE
program is to develop and apply computer software
technologies to absorb, analyze and interpret huge volumes of speech
and text in multiple languages. As part of GALE's Distillation Engine,
the CLAIR group has created CAFE - CMU's Adaptive Filtering Engine,
which effectively combines adaptive filtering, passage retrieval, and
novelty detection for utility-based information distillation. CAFE
learns long-lasting information needs of users from fine-grained user
feedback over a temporal sequence of ranked passages. Users can choose
to suppress information they already know about, or information already
presented to them by the system in the past. CAFE has received several
positive reviews from experienced intelligence analysts in recent
evaluations.
2. RADAR
(Reflective Agents with Distributed Adaptive Reasoning)
The
RADAR project is
aimed at helping its human master with tasks like scheduling meetings,
allocating resources, creating coherent reports from snippets of
information, managing email by grouping related messages, flagging high
priority requests and automatically proposing answers to routine
messages.It applies learning technology to e-mail, calendars, web
sites, etc to create a pro-active assistant whose goal is to improve
human productivity duration of the project and to develop a system that
can both save time for its user and improve the quality of decisions.
Our contribution in the project is biulding a robust classifier
(adaptive Logistic Regression classifier) that assigns one of the nine
predefined categories to the Emails.
3. Machine Learning Approaches to Proteomic Problems
Tandem mass spectrometry (MS-MS) enables powerful new proteomics
approaches to discover biomarkers of disease, therapeutic responses and
toxicity. The objective of this project is to create a comprehensive
proteomics platform for proteome characterization and biomarker
discovery. We adopt a three stage approach to provide a high degree of
interpretability to the results. The goals of the stages respectively
are
1) to develop algorithms and software for high recall identification of
peptide sequences from multiplexed MS-MS spectra by database searching,
2) to effectively combine evidence from the previous step to build
reliable protein identification systems and
3) to develop techniques for multivariate comparisons of complex
proteome data sets representing different biological states.
The last step will employ a a three-layered classification approach,
which integrates multiplexed MS-MS spectra, peptide identifications,
and protein identifications.
We have developed novel probabilistic generative models and learning algorithms for the discovery of graph structures and relational links among objects and classes of objects (genes, documents, words, topical categories, etc.). We impose a global hierarchical Bayesian prior (ARD-style Wishart prior) to the precision matrix of the Graphical Gaussian Model (GGM) for learning sparse graphs, and we extend lasso regression to enable global optimization with a quadratic time complexity. Main applications of this new framework include the learning of gene regulatory networks based micro-array data of gene expressions, and the discovery of topic networks based on both the class labels of documents, the word-level similarities among documents, and possibly link-associated features (anchor text or link labels). Our experiments show significant improvements in prediction accuracy when using the induced networks to analyze the functions of genes and to classify documents. Our approach is scalable to very large datasets where previous solutions have major scaling issues. We have successfully learned genome regulatory networks with about 20,000 nodes, for example.
The goal of our
current F/A-18 project is to help user fill out a complex form in order
to increase their productivity. We are building a recommendation system
that would help the military personnal reduce their monetary
expenditures and efforts in F/A-18 aircraft maintainance tasks.
Potentially we can also help troubleshoot hard problems, or solve
simple problems without seeking engineers for help. We have done some
work using Logistic regression to predict some fields in the form based
on user's problem description. In the next stage we aim to use EVSM
model to solve the problem more flexibly with given information of any
type (for example the author's name only), give suggestion to any other
types (for example a title text in the past cases, or priority).
This project is aimed at developing a system that learns to organize a
user’s emails by their priority. The system will learn the user’s
priorities by observing his or her interaction with the email client
(which email gets read or replied to first, which gets deleted without
reading, which set left in the mailbox for a long time and so on), in
addition to sporadic, explicit feedback from the user. Using this
information, the system will learn to predict the priority of an
incoming message from its contact.
The main risk to the subject from participation in this study are
embarrassment and the loss of employment or social standing as a result
of the accidental disclosure of confidential information contained in
one or more email messages. Every effort will be made to protect the
privacy of subjects participating in this study, including the
anonymization of all personally identifying information (names, email
addresses, etc.), but some risk of privacy loss will always remain. The
anonymization of the personally identifying information would be done
automatically by our system and the user will be able to anonymize any
further information in his/her emails which he feels necessary to be
anonymized and has not been taken care of by our system. The system will
then learn such additional information and anonymize the similar text in
future without bothering the user.
Classification, Language Analysis and Information Retrieval (CLAIR)
Language Technologies Institute (LTI), School of Computer Science (SCS)
Carnegie Mellon University (CMU), Pittsburgh, PA 15213, USA