Our goal is to address the open challenge of Multifactorial Topic Detecting and Tracking (mfTDT) comprising of detecting and tracking evolving topics, such as scientific trends or political developments at different levels of granularity. This project achieves this goal via Multifactorial representation (MFR) for represent cooccurring entities, features (words), links, impact indicators and other metalevel features in publication records or newsstory documents. MFR combined with algorithms for efficient indexing, link analyses, impact indicator extraction and multiaspect query expansion enables our system to detect and track multifactorial evidence of topics and trends. Our model is a unified probabilistic framework with a new family of Bayesian graphical models, namely the multifield Hierarchical Correlated Topic Modeling (mfHCTM), for simultaneously discovering multitype hierarchies (one per field) of topics and relations. By training mfHCTM on temporallysliced data chunks, and by supporting querydriven trend analysis, MfHCTM addresses the fundamental limitations of existing methods in stateoftheart Bayesian graphical models. For evaluation we prepared 4 large benchmark datasets including TDT4 (27K documents over a span of 4 months), TDT5 (270K documents over a span of 6 months), Citeseer (700K articles over a span of 27 years) and Arxiv (200K documents over a span of 6 years). These datasets enable us to thoroughly evaluate out proposed methods, and will substantially help the research community to compare topic models.
Topic detection and tracking based on multifactorial evidence in scientific and technical literature is extremely important and not yet addressed by the machine learning and information retrieval communities. The productivity of researchers highly depends on the availability of uptodate information about related work and a global picture about what’s going on in related fields. Strategic plans and funding decisions by government agencies (such as NSF, NIH, DARPA and IARPA) also depend on informative overviews of scientific emergence and coemergence within and across many fields of research, along with evidence of their impact. Industries (both large ones such as Google, Microsoft and Yahoo! and small ones such as many startups) desperately want mfTDT techniques in order to effectively assess and predict the impact of new technologies and to dynamically adjust their investment strategies. Furthermore, education in all universities requires instructors and students to have comprehensive and uptodate understanding about how science and technologies are evolving over time, how multiple fields relate to each other, and which technologies trigger rapid developments of other technologies. The proposed techniques provide principled and effective solutions for mfTDT with a broad future impact in the applications listed above and beyond.
Clustering has emerged as an important tool for the end user to obtain a structured view of the data. Current clustering models make some strong assumptions about the instance representation for e.g in LDA instances are assumed to be discrete feature-counts. However decades of research in information retrieval has established that normalized data representation using Tf-Idf gives significantly better performance than count-based features. To bridge the gap, we developed a Bayesian clustering framework based on the von Mises-Fisher (vMF) distribution that not only models such normalized Tf- Idf representation but also retains the flexibility of graphical models. Our framework is well suited for text data and discovers more intuitive clusters than existing approaches. Our experiments on six datasets provide strong empirical support in favour of Bayesian vMF based clustering models over other popular tools such as K-means, Multinomial Mixtures and Latent Dirichlet Allocation. Figure 1 shows the better performance of our model (Bayesian vMF) in terms of the popular clustering metric normalized mutual information (NMI) on the TDT4 news stories dataset which we preprocessed last year with 622 documents and 34 ground truth clusters.
Often the data that the user wants to analyze is large and manually inspecting a flat layer of several clusters is hard. For such cases, we developed a hierarchical vMF model that enables hierarchical nesting and organizes the data into increasing levels of granularity as defined by the input hierarchy. The experiments showed significant improvement in performance of our Hierarchical vMF model over the flat vMF model in terms of likelihood. The figure below shows the higher likelihood achieved by the hierarchical model over the flat vMF model on the famous 20 newsgroup dataset; the x-axis denotes different 3-level deep hierarchies with different branching factors.
There is a great need in practical applications for analyzing and maintaining data collections where data consists of multiple fields with different but interrelated information. Existing models either cannot handle such multiple fields in the data or simply concatenate all the fields into one, but of which are ineffective. We extend our Bayesian vMF based clustering model to seamlessly handle multifield data representation for instances. Our model inherently leverages the correlation between the different fields while clustering the data which leads to improved clustering. We also extended our flat Bayesian multi-field vMF clustering model to be able to generate multiple field-specific hierarchies that shows the distributions of features in each field at various levels of granularity. The figure below shows the higher likelihood achieved by our hierarchical multifield model over the flat multifield model on the full citeseer dataset containing 716,772 documents and spanning research articles over a decade from 1994 to 2004. The x-axis denotes three different hierarchies with different branching factor and 3 levels deep.
A full multifield hierarchical browser of the entire citeseer corpus can be found here . Each node in the hierarchy branches of into 10 child nodes with higher granularity. The top representative features from each field are also shown.
We further enhanced the prediction power of our system by developing a novel graphbased transductive learning component, namely Transductive Learning over Product Graph (TOP), which simultaneously extracts multitype associations from different sources of data, maps heterogeneous types of objects and relations onto a unified product graph, and performs joint inference about topic labels of documents via transductive label propagation over the product graph. This approach is particularly effective in transductive learning scenario where labeled documents are very sparse and unlabeled documents are massively available, and when the manifold structures are highly informative but varying in different fields of cooccurrence data. In our experiments with a subset of DBLP publication records (34K users, 11K papers and 22 venues) and an Enzyme multisource dataset (445 compounds, 664 proteins), CGRL successfully scaled to the large crossgraph inference problem, and outperformed other representative approaches significantly (H Liu and Y Yang, ICML 2016).