Large-scale Transdutive Learning from Heterogeneous Data Sources

A NSF funded project 1546329


PI: Yiming Yang, Carnegie Mellon University

Students involved : Hanxiao Liu, Yuexin Wu, Wei-Cheng Chang, Jingzhou Liu, Ruochen Xu

Goals and Objectives

Important problems in the big-data era involve predictions based on heterogeneous sources of information and the dependency structures in data. In recommendation systems, for example, predictions need to be made not only based on observed user ratings over items (movies, books, music, shopping products, etc.), but also based on information such as demographical data of users and textual descriptions of items. In event detection from textual data (news stories, tweets, maintenance reports, legal documents, etc.), joint inference must be based on who (agents), what (event types or topics), where (locations) and when (dates), and also based on the connections among agents (in social networks), topics (in an event-type ontology), locations (in a map) and temporal co-occurrences. The fundamental research questions therefore include: 1) how to develop a unified optimization framework for predictions based on heterogeneous information and dependency structures in various kinds of tasks, 2) how to make the inference computationally tractable when the combined space of model parameters is extremely large, and 3) how to significantly enhance the prediction power of the system by leveraging massively available unlabeled data in addition to human-annotated training data which are often sparse.

This project will address the three challenges via the following approaches:

The proposed work has yielded significant impacts on both machine learning algorithms and real-world applications in multiple fields, as illustrated below.


Transductive Learning over Graphs

We developed a novel graph-based transductive learning framework, namely Transductive Learning over Product Graph (TOP), which simultaneously extracts multi-type associations from different sources of data, maps heterogeneous types of objects and relations onto a unified product graph, and performs joint inference about topic labels of documents via transductive label propagation over the product graph. This approach is particularly effective in transductive learning scenario where labeled documents are very sparse and unlabeled documents are massively available, and when the manifold structures are highly informative but varying in different fields of co-occurrence data.

In our experiments with an Enzyme multi-source dataset (445 compounds, 664 proteins) and a subset of DBLP publication records (34K users, 11K papers and 22 venues) (Figure 1), CGRL successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly (Figure 2) (Hanxiao Liu and Yiming Yang, ICML 2016).


Fig 1. Prediction of associations among heterogeneous graphs on the Enzyme (left) and DBLP (right) datasets. The blue edges represent the within-graph relations and the red edges represent the cross-graph interactions.

As a complementary direction, we also developed a novel nonparametric framework (Hanxiao Liu and Yiming Yang, AISTATS 2016) for semi-supervised learning and for optimizing the Laplacian spectrum of the data manifold simultaneously. The new technique can be incorporated into any homogeneous transductive learning algorithm, including our own works in ICML’16 (over the product graphs). The new formulation leads to a convex optimization problem that can be efficiently solved via the bundle method, and can be interpreted as to asymptotically minimize the generalization error bound of semi-supervised learning with respect to the graph spectrum. Experiments over benchmark datasets in various domains (text, image, audio) show advantageous performance of the proposed method over existing graph-based semi-supervised learning algorithms.


Fig 2. The results of TOP (our method), LTKM (low-rank tensor kernel machine), NN (nearest neighbor), RSVM (ranking SVM), TF (tensor factorization) and GRTF (graph-regularized tensor factorization) on benchmark data.

Analogical Learning for Multi-label Relational Learning

For knowledge base completion we developed a novel framework that explicitly imposes analogical structures in multi-relational embedding (Figure 3). Our model enjoys both theoretical power and computational scalability, and significantly outperformed a large number of representative baseline methods on benchmark datasets. It also offered a unified view by subsuming several representative methods recently developed in machine learning for multi-relational learning(ICML 2017).


Fig 3. Commutative diagram for the analogy between the Solar System (red) and the Rutherford-Bohr Model (blue) (atom system). The new relation (nucleus attract charge) is inferred by analogy from the existing mirror structures.

We also enhanced the power of knowledge transferring through graph-based kernel induction (AAAI 2017). Our new framework does not require a shared feature space but instead used a parallel corpus to calibrate domain-specific kernels into a unified kernel for label propagation across languages/domains for semi-supervised learning based on labeled and unlabeled data. Our experiments on benchmark datasets showed advantageous performance of the proposed method over that of other state-of-the-art transfer learning methods (Figure 4).


Fig 4. The results of KerTL (our method) and other state-of-the-art methods on benchmark data APR and MNIST.

Other accomplishments partially under the NSF grant include the development of a deep learning framework for extreme multi-label text classification (SIGIR 2017), a large-scale kernel approximation algorithm (IJCAI 2017), and our cross-lingual distillation framework for text classification (ACL 2017).


Last Modified

Thu Apr 19 14:26:49 EDT 2018