A Brief Literature Review of Semi-supervised Learning
Yanjun Qi
Large amounts of unlabeled data are available in many real-life data-mining tasks. Consequently, semi-supervised learning has become a topic of significant recent interests. Many research works are proposed within this area these years. Here we aim to give a brief review about the related literature and also try to give a framework for related research efforts.
Semi-supervised learning combines both labeled and unlabeled data and is applicable to both classification and clustering. In supervised classification, there is a known, fixed set of categories and category-labeled training data is used to induce a classification. In semi-supervised classification, training also exploits additional unlabeled data, frequently resulting more accurate classification function [1-21]. At most of times, semi-supervised learning represents this case. But in recent years, there are some researchers successfully use labeled or labeled style data to help the unsupervised clustering, which could be called semi-supervised clustering [25-35]. We briefly overview these two categories’ work separately.
In the terminology of supervised learning, a semi-supervised classifier trains on both the training set and the unlabeled examples. Its desire is to employ both labeled and unlabeled data to improve learning accuracy. In [5][10][22], the authors tried to give some overview about the existing semi-supervised classification methods. [10] pointed out that generally speaking, unlabeled data does not help classification when the unlabeled data distribution does not relate to the classification task, or in a form that the classifier is not able to represent or exploit. The general assumption for unlabeled example to be helpful is that P(x) tells something about P(y|x) relevant to the decision boundary. We assume that labeled data and unlabeled data should be drawn IID from the same distribution. One assumption used by many existing algorithm is that conditionally distribution P(y|x) should not change much when the marginal density P(x) is high (which also termed as cluster assumption in [22]). Theoretical insights into semi-supervised classification are rather limited. [5] referred that labeled data is exponentially more useful than unlabeled data under certain assumptions.
We roughly follow [5][10]’s framework to classify the existing semi-supervised classifiers and make an overview for each kind of them.
In addition, Zhu 2003 Thesis
Proposal : http://www-2.cs.cmu.edu/~zhuxj/pub/sslproposal.pdf
made
a summary about the semi-supervised methods. He proposed that there are five
representative semi-supervised classifiers in the existing works: (a) mixture
models; (b). Transductive SVM; (c). Graph based method; (d). Metric based
method to reduce overfitting by unlabeled data; (e). co-training. Actually this
categorization style is roughly corresponding with mine.
Generative models that jointly account for both x and y can be naturally take advantage of the unlabeled examples. Consider a joint model P(x,y| theta) , unlabeled examples can be used to estimate parameter theta, for instance, by maximizing the joint likelihood. To maximize the joint log-likelihood in this case, the maximization is on the sum of the joint likelihood from labeled data and the marginal likelihood from the unlabeled data (because no y available for unlabeled). The expectation-maximization (EM) provides an efficient way of solving this optimization problem.
This topic was an old one in the statistics community. [5] mentioned that most their works were about iterative algorithms for building maximum likelihood classifiers from labeled and unlabeled data, primarily for mixtures of normal distributions. In mid 90’s, machine learning community began to put efforts ([4], [7], et al.) on using likelihood maximization of mixture models for semi-supervised classification.
A lot of work did the above job in a parametric way, which used a family of joint distributions to model the available information. For examples, David 96 [4] use the Gaussian mixture clusters to model the data. Jordan & Jacobs 1994 [5] proposed the hierarchical mixture-of-experts model, which is similar with mixture models, and their parameters are set with EM. Nigam 98 [5] used the multinomial naïve Bayes model to mode the text documents. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumption of the model.
But these assumptions are always violated in practice and were reported that the unlabeled data in this case may even hurt the performance of the mixture models by [23, 24]. This is probably costed by the joint estimation approach may focus too much attention on modeling aspects P(x) that we do not care about, and not pay enough attention to P(y|x). Then Nigam [4,5] extends their work to include a weighting factor to modulate the contribution the unlabeled data. They also changed to use a multiple mixture component model per class. [6] pointed out that the weighting factor proposed from [4] is too heuristic and they presented to use homotopy continuation approach to find the weighting between labeled contribution and unlabeled contribution for the task. The method is a general regularization approach useful for combining heterogeneous sources and extendable to the case of an incorrect model family.
There are many semi-supervised classifiers that could be categorized as discriminative style. The key question here is: how to relate the conditional probability P(y|x) to the marginal distribution P(x). Based on different assumptions, these classifiers are divided primarily as followings:
Joachims 99 [21] attempts to use transduction with support vector machine to maximize the classification margin on both labeled and unlabeled data, while classifying the labeled data as correctly as possible. The discriminative method imposed fewer restrictions on the data model than do Gaussian mixture models. However, finding the optimal decision boundary requires searching over possible labels of unlabeled points, which is NP-complete. It seems that the intuition behind transductive SVMs is that they assume decision boundaries lie between classes in low-density regions of instance space, and that unlabeled examples help find these areas. However, then there were research work that argue both theoretically and experimentally that transductive SVMs are likely to be helpful for classification in general [5].
Kristin 2002 [20] proposed to change use an ensemble (booting) algorithm to handle semi-supervised problems. It alternates between assigning “pseudo-classes” to the unlabeled data using the existing ensemble and also constructing the next base classifier using both the labeled and unlabeled data. Mathematically, this intuitive algorithm corresponds to maximizing the classification margin in hypothesis space as measured on both the labeled and unlabeled data. And actually this ensemble method using decision trees won the NIPS 2001 unlabeled data competition.
The maximum entropy framework proposed by Jaakkola 1999, also try to optimize the margin based on labeled and unlabeled data in a discriminative framework. It finds the maximum entropy (or minimum relative entropy) distribution over classifier parameters, subject to the soft constraints that each labeled examples is at least a fixed margin from the decision boundary. Instead of searching over the labeling directly, it searches over the distribution of labeling for both labeled and unlabeled. Unlabeled data is used to add a distribution over the unknown class labels that are jointly estimated by maximum entropy.
Within semi-supervised learning problems, there is one big kind algorithms learning with only positive and unlabeled examples. This always happens for some real applications, like fraud detection, anomaly detection, and so on. [19] presented an approach to learn from positive and unlabeled examples by weighted logistic regression. It proves that: for noisy observed label y’ and actual label y, P (f(x) | y’=-1) + P (f(x) =-1|y’=1) is linear related to P (f(x) =1|y=-1) + P (f(x) =-1|y=1). The algorithm multiplied examples that are labeled positive / negative by weight P [y’=-1] or P (y’ =1). They are called weighted examples. After weighting, if it set threshold to be 0.5, we would get correct classification. It tried to minimize an upper bond to the sum of false positive and false negative frequency. The paper brought out a new performance measure: precision*recall/P(y=1). It is proportional to the square of geometric mean of precision & recall (similar with F measure). Finally it claimed that this weighted logistic regression method seems effective when linear function is good classifier and F-Score is a good performance measure.
From probability aspect, the key question of using unlabeled data is: how to relate the conditional probability P(y|x) to the marginal distribution P(x). Szummer [12] gave the condition probability the following form:
P( y|x ) = Sum i = 1 to N { P( y|i ) * P( i | x) } / N;
It means that the conditional probability of
specific input x , is related with all the input labeled and unlabeled data.
Then we could use different ways to model the marginal P( i | x) just as followings.
l Non-Parametric modeling of Marginal Density P(x):
Szummer and Jaakkola 2000 [10, 12] used unlabeled data in classification through non-parametric kernel expansion. In essence, the features of each labeled data point include the kernel densities between it and all other examples, labeled and unlabeled. Using this general-purpose kernel density model, the actual classification step can be done by maximizing the conditional likelihood, or by using margin based regularizers, or by doing maximum entropy regularization. The intuition is that the unlabeled data are used to weight the relative importance of the labeled data based on the instance integrity. The weighting informs the discriminative training about which examples are more important, and which are less so. It is very essential to carefully select the form and width of the kernel to accurately model the underlying instance distribution.
l Graph (neighborhood) Structure Assumption of Marginal Density P(x)
The kernel expansion does not take advantage of any special structure in the data input space. [11-12, 13 – 18] is a promising family of techniques that exploit the “manifold structure” of the data; such methods are generally based up an assumption that similar unlabeled examples should be given the same classification. Most of this family of methods would first place the data points on to a graph based on the distance relationships between examples and then use the known labels to perform some type of graph partitioning. Also these kinds of methods have close relationship with diffusion process or spectral clustering.
Szummer and Jaakkola 2000 [11, 12] proposed to use the Markov random walk representation to model the marginal, which try to exploit the manifold structure within data. The data was first transformed onto a graph with edges weighted by some metric distance between data points. Then follow the data manifold by performing a random walk on the graph and construct the representation from the Markov random walk probability (conditional probability). And actually the random walk could take either generative style, or classification style. In generative style, it infer the label by averaging over the label distribution of the starting points. In classification style, its forward-walk model start from some point, and calculate the ending point conditional probabilities.
Zhu 2003 [13] proposed a semi-Supervised learning using Gaussian Fields and Harmonic Functions. The algorithm is also working on a graph data. Labeled and unlabeled data are represented as vertices in the weighted graph, with edge weights encoding the similarity between instances. The problem is formulated in terms of a Gaussian random field on the graph. The mean of the GRF is characterized in terms of harmonic functions. This method can be viewed as a form of nearest neighbor approach, where the nearest labeled examples are computed in terms of a random walk on graph. The paper made a very good relationship discussion about its method with random walk, electric networks and spectral graph theory. Moreover, the paper made use of class mass normalization (CMN) to adjust the class distributions to match the priors. The paper used “average label entropy” and gradient decent to learn the parameters of its weight model.
Zhou 2003 [15-17] brought out the concept of consistency assumption: a classifier should be sufficiently smooth with respect to the structure revealed by these known labeled / unlabeled points. And it proposed the key assumption of a semi-supervised learning should be local consistency plus global consistency. They first based on pair-wise distance to change the vectorial data set into a graph data set. Then, they use an idea very similar with random work to spread labels to the whole graph. The algorithm is a combination of spectral clustering and spreading activation network. They verified the methods using a physical spin model. The paper provides the quantified way to select two of the model parameters, one based on local consistency, the other derived on global consistency assumption. It also applied this very similar framework in [17], where the learning process is trained on only positive and unlabeled examples.
Blum 2001 used the minimum cut to partition the weighted graph constructed from the data instances with agrees with the labeled points, which can be thought of as giving the most probable label assignment if one views labels as generated according to a Markov random field on the graph. Blum 2004 [14] made an extension by adding randomness to the graph structure and gave theoretical justification from both a Markov random field perspective and from sample complexity considerations.
Chapelle 2003 [18] introduced a framework to design “cluster kernels” so that the distance between two points is smaller when they are in the same cluster.
l Other Structure Assumption of Marginal Density P(x)
Besides of methods taking advantage of “manifolds” structure to model the marginal, for different application domains, there are many other structure assumptions researcher could make in case of different domain. For instance, trees prominently arise in both natural and human-generated domains (e.g. in biology and language). Griffiths 2003 [9] proposed an approach to semi-supervised learning based on embedding the data in an ultrametric space – a rotted tree with the data located at leaf notes equidistance from teh roote. The tree (or a distribution over trees) may be infered using the unlabeled data. Then they did the non-parametric Bayesian classification. A prior over concepts generated by a mutation process on the inferered trees allows efficient computation of the optimal Bayesian classification from the labeled examples.
Co-Training [1-3] classifiers data that exhibits multiple views, specifically when data exhibits multiple views, specifically when data attributes can be partitioned into groups (views) that are individually sufficient for learning but conditionally independent given the class. It works by seeding classifications made by one learner as examples for the other and vice-verca. However, in practice the individual learners are noisy, and then co-training easily goes down the wrong path. Blum and Mitchell (1998) [1] show that under certain theoretical assumptions, a weak learner can be arbitrarily improved given sufficient unlabeled examples.
Muslea 2002 [2] gave a summary comparison between Mixture model with EM, Co-Training, Co-EM and their proposed active learning Co-Testing plus Co-EM. They claimed that active learning improved the performance a lot and robust at the same time.
Denis 2003 [3] explores the similar idea but to the specific problem of learning from a pool of positive data, without any negative data but with the help of unlabeled data. It presented two algorithms: PNB and PNCT. PNB is a naïve Bayes algorithm from positive and unlabeled examples, with underlying generative model assumption. Then it designed a co-training algorithm PCNT assuming that the datasets have a natural separation of their features into two sets and the available positive data is very small. It incrementally builds classifiers over each of the two views with a PNNB algorithm. The PNNB algorithm estimates negative word probabilities by combination of set of negative examples and set of positive examples plus unlabelled data.
Although often not explicit stated, selected sampling is another way of integrating unlabeled data into supervised learning. Selective sampling is a form of active learning where one must select an existing example for labeling. One approach for selecting examples is to try to maximally reduce the variance component of classification error. Another is query-by-committee, whereby a committee of classifier variants is used to select an example that has high classification variance itself. There is a large research community working on this topic. Here we would not cover more detailed review.
Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. [25-28] are some recent work about semi-semi-supervised clustering. Generally speaking, semi-supervised clustering falls into two general approaches: search-based and similarity-based.
In search-based methods, the clustering algorithm itself is modified so that user-provided labels or constraints are used to bias the search for appropriate partition. [25] (2002) explores the use of the labeled data to generate the initial seed clusters, as well as the use of constraints generated from labeled points to guide the clustering. Here the constraints are provided in the form of “must-link” or “cannot-link”. The clustering model actually is a probabilistic style K-Means, where the labeled data provides prior information about the conditional distribution of hidden category labels. The search-based method could also be done by modifying the clustering’s objective function to satisfy those constraints.
In similarity-based approaches, an existing clustering algorithm that uses a similarity metric is employed; however, the similarity metric is first trained to satisfy the labels or constraints in the supervised data. We would cover this topic in the following section 3 in details.
[26, 28] presents a unified approach to incorporate both of the search-based and similarity-based techniques and proved to be better than either of the individual approaches. It not only formulates the goal of clustering in the pairwise constraints as minimizing a combined objective clustering function, but also further parameterized Euclidean distance with a symmetric positive-definitely weight matrix. But in their real implemented method, this matrix is assumed into too simple diagonal format trained from labeled data.
[27] won the best paper award of KDD 2004 and formalize the above semi-supervised learning into the probabilistic framework. The semi-supervised clustering is on Hidden Markov Random Field (HMRF) and is a generalization version for the method [28] we talked in the last paragraph, that combines constraints and Euclidean distance learning. This method allows the using of a broad range of clustering distortion measures, including Bregman divergence (e.g. Euclidean distance and I-divergence) and directional similarity measure (e.g. cosine similarity). [27] also proposed a way that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model.
Many supervised and unsupervised learning algorithms are very sensitive to the choice of an appropriate distance metric. Some commonly used algorithms are nearest neighbor classifiers, radial basis function networks, and support vector machines for classification tasks and the k-means algorithm for clustering tasks. The performance of these methods often depends critically on the choice of an appropriate metric. Instead of choosing the metric manually, a promising approach is to learn the metric from the data automatically. Most works related in this section can be also correspondingly put into the above sections, but here our emphasis is more put on the distance metric learned.
This idea could be dated back to some early work on optimizing the metric for kNN density estimation. More rent research along this line continued to develop various locally adaptive metric for nearest neighbor classifiers. Besides kNN, there are other methods that also perform metric learning based on nearest neighbors, e.g., radial basis function network and variants. The information used in these works could be called neighbor information and essentially they take into account the manifold structure of the feature space by preserving local metric information in the input space.
Besides getting metric only from features, researchers recently also try to adjust the distance metric by the labeling within classification framework. [34-35] are some recent works about learning distance metric from the help of label information.
Under the classification setting, Zhang 2003 [35] proposed to learn the metric from data via kernel methods and multidimensional scaling (MDS) techniques. [35] define discriminant kernels on the joint space of input features and output label spaces. In essence, the kernel is just a weighted sum of a kernel on feature and a kernel on the label.
Under semi-supervised setting, unlabeled data can also be used to adapt kernels and distance metric. For example, in the Fisher kernel, unlabeled data is used to train a generative model from which a kernel is derived. The kernel can then be used in any discriminative kernel method, such a support vector machine or a nearest neighbor classifier.
The right distance metric is task-specific and should be learned. McCallum and Minka 1999 addressed this issue by using the limited labeled data and large quantities of unlabeled data to bootstrap the learning of a distance metric for a multinomial naïve Bayes classifier. In this classifier, the metric is represented with Polya trees (Dirichlet trees) as prior over multinomial mixture components. This is generative style. Alternatively, [34] (2004) seek a discriminative model that also learns a distance metric as an integral part of its training. They outlined a model for conditionally-trained, semi-supervised learning, where parameters learned from maximum likelihood estimation. The approach is based on Markov random fields with tie parameter, which affect classification of individual instances based on their own features and also capture the similarity between pairs of instances. Unlabeled data can be used in both training and classification. Further due to intractable exact inference, it used a graph partitioning algorithm to approximate inference.
While class label information is available for metric learning in classification tasks, such information is generally unavailable in conventional clustering tasks. To adapt the metric appropriately to improve the clustering results, some additional background knowledge or supervisory information could be made available. This information usually is called pairwise (dis)similarity side information, and less structured then the traditional, homogeneous training sets expected by them. [29-33] are some recent works about learning distance metric from the side information.
Eric Xing (2002) [29] proposed using pairwise side information in a novel way to learn a global Mahalanobis metric before performing clustering with constraints. They provided a systematic way and posed the metric learning as a convex optimization problem. The criterion is simply to minimize the sum of distance between similar points and constrained on the such of distance of dis-similar points or general points. The optimization is done by a iterative process.
Instead of the iterative optimization process in [29], Bar-Hillel [31] proposed the metric learning method called Relevant Component Analysis (RCA) algorithm, which is a simple and efficient algorithm for learning a full ranked Mahalanobis metric. It first used equivalent constraints to get small groups of similar data points called “chunklet”, and then used the sum of these chunklets’ convariance matrix to make a global linear transformation for original data. The method aims to assign large weights to “relavant dimensions” and low weights to “irrelevant dimensions”, where the “relevant dimensions” are estimated using “chunklets”. The RCA transformation is intended to reduce clutter, so that in the new feature space, the inherent structure of the data can be more easily unraveled. In [31], the author extends the RCA by incorporating both pair similarity and dissimilarity constraints into EM algorithm for modeling based clustering based on Gaussian mixture models.
[30] proposed to learn metric by performing nonlinear transformation globally but linear transformation locally. It also only consider pairwise similarity constraints and formulate the learning problem as a optimization problem. We could see it as a method unifying neighbor information and the side information.
Reference:
1) Avrim Blum, Tom Mitchell (1998), Combining Labeled and Unlabeled Data with Co-Training, COLT: Proceedings of the Workshop on Computational Learning Theory
2) Ion Muslea, et al (2002), Active + semi-supervised learning = robust multi-view learning, ICML-2002
3) F. Denis, etc (2003) Text classification and Co-training from positive and unlabeled examples. ICML-2003 workshop: the Continuum from labeled data to unlabeled data in Machine Learning and Data Mining
4) K. Nigam, etc, T. Mitchell. (1999) Text classification from labeled and unlabeled documents using EM. Machine Learning, 1999
5) K. Nigam (2001) Using Unlabeled Data to Improve Classification. CMU-CS-01-126 Technical Report 2001.
6) A. Corduneanu, T. Jaakkola, (2002) Continuation Methods for Mixing Heterogeneous Sources. UAI 2002
7) David J. Miller, et al.(1996). A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. NIPS 1996
8) Evgenia Dimitriadou, et al. (2002) A Mixed Ensemble Approach for the Semi-supervised Problem. ICANN 2002: 571-576
9) C., Griffiths, T. L., et al. (2003) Semi-Supervised Learning with Trees. NIPS 2003
10) M.
Szummer and T. Jaakkola. Kernel expansions with
unlabeled examples. NIPS 2000.
11) M. Szummer and T. Jaakkola. Partially labeled classification with markov random walks, NIPS 2001
12) M. Szummer (2002) Learning from Partially labeled data, Doctor Thesis, MIT, 2002
13) X. Zhu, et al. (2003) Semi-Supervised learning using Gaussian Fields and Harmonic Functions. ICML 2003
14) A. Blum et al. (2004) Semi-supervised learning using randomized Mincuts. ICML 2004
15) D. Zhou, et al. (2003) Learning with local and global consistency. NIPS 2003
16) D. Zhou et al. (2004) Semi-supervised learning by maximizing smoothness, (submitted) Journal of Machine Learning research X (2004)
17) D. Zhou, etc. (2003) Ranking on Data Manifolds. NIPS 2003
18) Chapelle, O., J. Weston and B. Schölkopf: (2003) Cluster Kernels for Semi-Supervised Learning. NIPS 2003.
19) W. Lee (2003) Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. ICML 2003
20) Bennett, K. P., et al. (2002). Exploiting Unlabeled Data in Ensemble Methods. In Proceedings of the Eighth ACM SIGKDD 2002
21) T. Joachims.(1999). Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, pages, 1999.
22) Seeger,
M. (2001). Learning with Labeled and unlabeled data. (Technical Report).
23) Cozman, Fabio G.; Cohen, Ira (2001). Unlabeled Data Can Degrade Classification Performance of Generative Classifiers. HP Technical Report HPL-2001-234
24) Fabio Gagliardi Cozman, Ira Cohen (2003), Semi-Supervised Learning of Mixture Models, (ICML-2003)
25) Sugato Basu, Arindam Banerjee, and Raymond J. Mooney, (2002), Semi-supervised Clustering by Seeding, (ICML-2002)
26) Sugato Basu, et al (2003 ),Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering, ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining
27) Sugato Basu, et al, A Probabilistic Framework for Semi-Supervised Clustering, KDD-2004, (Best Research Paper Award)
28) Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney, (2004) Integrating Constraints and Metric Learning in Semi-Supervised Clustering, (ICML-2004)
29) Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, and Stuart Russell (2002), Distance Metric Learning, with application to Clustering with side-information, NIPS 2002
30) Hong Chang, Dit-Yan Yeung (2004), Locally Linear Metric Adaptation for Semi-Supervised Clustering, (ICML-2004)
31) Aharon
Bar-Hillel, et al. (2003). Learning a Mahalanobis Metric with Side
Information.
32) Noam
Shental, et al. (2003) Computing Gaussian Mixture Models with EM using
Equivalence Constraints.
33) Tomer Hertz, et al. (2004) Boosting Margin-Based Distance Functions for Clustering. In Proceedings of 21st International Conference on Machine Learning (ICML- 2004 )
34) Wei
Li and Andrew McCallum (2004). A Note on
Semi-supervised Learning using Markov Random Fields. Technical
Note,
35) Zhihua Zhang (2003). Learning Metrics via Discriminant Kernels and Multidimensional Scaling: Toward to Expected Euclidean Representation. (ICML'03)