|
Language
Technologies Institute |
Information Retrieval Lab, 11-742 |
Guidelines:
This project aims to make a comparative study for several semi-supervised classification algorithms proposed in the recent literature. We focus on the situation that the objective problem sets have a highly unbalanced class distribution (binary classification case). We would present the empirical performance comparison from different aspects and also try to make some theoretic analysis about the result if possible. In addition to this comparison, we would provide the implemented methods and supporting functions for downloading.
Task
|
Done
By
|
Status
|
Proposal & Work Plan
|
August
20
|
Complete |
Literature Review of Imbalanced
set classification
|
September
10
|
Complete |
Literature Review of
Semi-supervised classification & Data Collecting
|
October
10
|
Complete |
Transductive SVM Train-Test |
October 15 |
Complete |
Manifolds graph-based
semi-supervised classifier Implementation & Train-Test
|
October
20
|
Complete |
MLE-EM approach Implementation & Train-Test |
November 15 |
Complete |
Comparison & Performance Analysis |
November 25 |
Complete |
Presentation
|
December
07
|
Complete |
Add Data Set analysis
|
December
07
|
Complete |
One key difficulty with the supervised learning approaches is that they require a large, often prohibitive, number of labeled training examples to learn accurately. Labeling must typically be done by a person; this is a painfully time-consuming process. This need for large quantities of expensive labeled examples raises the idea of learning classifiers from a combination of labeled and unlabeled data. In general, unlabeled examples are much less expensive and easier to come by than labeled examples. There are many related algorithms originated from this idea in these years. They are called semi-supervised classifiers.
The goal of this project is to implement several algorithms proposed in the recent literature and make an empirical comparison of these methods for a specific situation. That is, we put our attention on the problem sets include only two classes and have a highly unbalanced class distribution. Moreover, the asymmetric misclassification costs are not given explicitly in the problem. The asymmetric misclassification cost is meant that one of the class values is the target class value for which we want to get predictions and we prefer false positive over false negative.
The reason of this study is originated from the fact that: in some application areas, like image classification and protein interaction network, positive class and negative class distribute very skew ratio within real data. It is very time-consuming or even impossible to get enough labeled data for the classification. But large amount of unlabeled examples can be collected very fast and efficiently. For this specific situation, applying semi-supervised classification idea seems very natural. We hope this project can make some demonstration of the difference & similarity of some recent semi-supervised classifiers on this specific case.
To better understand the related previous research, we made two comprehensive literature review, one about imbalanced set classification, the other about semi-supervised learning. Actually our review about the semi-supervised learning includes more than semi-supervised classification emphasis in this lab. Based on our review, we decide to choose three semi-supervised classifiers and test their performance in our project:
· T. Joachims.(1999). Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, pages, 1999
· Semi-supervised learning by maximizing margin by the helping of unlabeled data
· Semi-supervised learning using Gaussian Fields and Harmonic Function
· X. Zhu, et al. (2003) Semi-Supervised learning using Gaussian Fields and Harmonic Functions. ICML 2003
· David J. Miller, et al.(1996). A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. NIPS 1996
· Semi-supervised learning of mixture models by focusing on maximum-likelihood estimators and generative models