Multi-field Hierarchical Discovery and Tracking

A NSF funded project 1216282

Project Personnel

PI at Carnegie Mellon University : Yiming Yang
Students involved : Siddharth Gopal, Hanxiao Liu, Yuexin Wu

Project Goals and Objectives

Our goal is to address the open challenge of Multi­factorial Topic Detecting and Tracking (mf­TDT) comprising of detecting and tracking evolving topics, such as scientific trends or political developments at different levels of granularity. This project achieves this goal via Multi­factorial representation (MFR) for represent co­occurring entities, features (words), links, impact indicators and other meta­level features in publication records or news­story documents. MFR combined with algorithms for efficient indexing, link analyses, impact indicator extraction and multi­aspect query expansion enables our system to detect and track multi­factorial evidence of topics and trends. Our model is a unified probabilistic framework with a new family of Bayesian graphical models, namely the multi­field Hierarchical Correlated Topic Modeling (mf­HCTM), for simultaneously discovering multi­type hierarchies (one per field) of topics and relations. By training mf­HCTM on temporally­sliced data chunks, and by supporting query­driven trend analysis, Mf­HCTM addresses the fundamental limitations of existing methods in state­of­the­art Bayesian graphical models. For evaluation we prepared 4 large benchmark datasets including TDT4 (27K documents over a span of 4 months), TDT5 (270K documents over a span of 6 months), Citeseer (700K articles over a span of 27 years) and Arxiv (200K documents over a span of 6 years). These datasets enable us to thoroughly evaluate out proposed methods, and will substantially help the research community to compare topic models.

Topic detection and tracking based on multi­factorial evidence in scientific and technical literature is extremely important and not yet addressed by the machine learning and information retrieval communities. The productivity of researchers highly depends on the availability of up­to­date information about related work and a global picture about what’s going on in related fields. Strategic plans and funding decisions by government agencies (such as NSF, NIH, DARPA and IARPA) also depend on informative overviews of scientific emergence and co­emergence within and across many fields of research, along with evidence of their impact. Industries (both large ones such as Google, Microsoft and Yahoo! and small ones such as many startups) desperately want mf­TDT techniques in order to effectively assess and predict the impact of new technologies and to dynamically adjust their investment strategies. Furthermore, education in all universities requires instructors and students to have comprehensive and up­to­date understanding about how science and technologies are evolving over time, how multiple fields relate to each other, and which technologies trigger rapid developments of other technologies. The proposed techniques provide principled and effective solutions for mf­TDT with a broad future impact in the applications listed above and beyond.




Point of Contact

For further information, please contact Yiming Yang .

Last updated

Mon Dec 12 02:08:11 EST 2016