Large-scale Transdutive Learning from Heterogeneous Data Sources

A NSF funded project 1546329


PI: Yiming Yang, Carnegie Mellon University

Students involved : Hanxiao Liu, Yuexin Wu, Wei-Cheng Chang, Jingzhou Liu, Ruochen Xu

Goals and Objectives

Important problems in the big-data era involve predictions based on heterogeneous sources of information and the dependency structures in data. In recommendation systems, for example, predictions need to be made not only based on observed user ratings over items (movies, books, music, shopping products, etc.), but also based on information such as demographical data of users and textual descriptions of items. In event detection from textual data (news stories, tweets, maintenance reports, legal documents, etc.), joint inference must be based on who (agents), what (event types or topics), where (locations) and when (dates), and also based on the connections among agents (in social networks), topics (in an event-type ontology), locations (in a map) and temporal co-occurrences. The fundamental research questions therefore include: 1) how to develop a unified optimization framework for predictions based on heterogeneous information and dependency structures in various kinds of tasks, 2) how to make the inference computationally tractable when the combined space of model parameters is extremely large, and 3) how to significantly enhance the prediction power of the system by leveraging massively available unlabeled data in addition to human-annotated training data which are often sparse.

This project will address the three challenges via the following approaches:

The proposed work, if successful, will offer principled solutions for enhancing the prediction power of systems in a broad range of tasks, whenever recommendation, classification and regression are involved. Technical impacts of the proposed work are expected in multiple research fields.


We developed a novel graph-based transductive learning framework, namely Transductive Learning over Product Graph (TOP), which simultaneously extracts multi-type associations from different sources of data, maps heterogeneous types of objects and relations onto a unified product graph, and performs joint inference about topic labels of documents via transductive label propagation over the product graph. This approach is particularly effective in transductive learning scenario where labeled documents are very sparse and unlabeled documents are massively available, and when the manifold structures are highly informative but varying in different fields of co-occurrence data.

In our experiments with an Enzyme multi-source dataset (445 compounds, 664 proteins) and a subset of DBLP publication records (34K users, 11K papers and 22 venues) (Figure 1), CGRL successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly (Figure 2) (Hanxiao Liu and Yiming Yang, ICML 2016).


Fig 1. Prediction of associations among heterogeneous graphs on the Enzyme (left) and DBLP (right) datasets. The blue edges represent the within-graph relations and the red edges represent the cross-graph interactions.

As a complementary direction, we also developed a novel nonparametric framework (Hanxiao Liu and Yiming Yang, AISTATS 2016) for semi-supervised learning and for optimizing the Laplacian spectrum of the data manifold simultaneously. The new technique can be incorporated into any homogeneous transductive learning algorithm, including our own works in ICML’16 (over the product graphs). The new formulation leads to a convex optimization problem that can be efficiently solved via the bundle method, and can be interpreted as to asymptotically minimize the generalization error bound of semi-supervised learning with respect to the graph spectrum. Experiments over benchmark datasets in various domains (text, image, audio) show advantageous performance of the proposed method over existing graph-based semi-supervised learning algorithms.


Fig 2. The results of TOP (our method), LTKM (low-rank tensor kernel machine), NN (nearest neighbor), RSVM (ranking SVM), TF (tensor factorization) and GRTF (graph-regularized tensor factorization) on benchmark data.


Last Modified

Thu Nov 3 14:35:42 EDT 2016