IR-Lab Project of Yanjun Qi


Title:  A Case Study of Semi-supervised Classification Methods for Imbalanced Classification Situation

 

Language Technologies Institute
Carnegie Mellon University
Fall 2004

Information Retrieval Lab, 11-742
Lab instructor: Yiming Yang

 

 


Guidelines:

 


Abstract:

This project aims to make a comparative study for several semi-supervised classification algorithms proposed in the recent literature. We focus on the situation that the objective problem sets have a highly unbalanced class distribution (binary classification case). We would present the empirical performance comparison from different aspects and also try to make some theoretic analysis about the result if possible. In addition to this comparison, we would provide the implemented methods and supporting functions for downloading. 

 

 

Task

Done By

Status

Proposal & Work Plan

August 20

Complete

Literature Review of Imbalanced set classification

September 10

Complete

Literature Review of Semi-supervised classification & Data Collecting

October 10

Complete

Transductive SVM Train-Test

October 15

Complete

Manifolds graph-based semi-supervised classifier Implementation & Train-Test

October 20

Complete

MLE-EM approach Implementation & Train-Test

November 15

Complete

Comparison & Performance Analysis

November 25

Complete

Presentation

December 07

Complete

Add Data Set analysis

December 07

Complete

 

 

 


Introduction:

            One key difficulty with the supervised learning approaches is that they require a large, often prohibitive, number of labeled training examples to learn accurately. Labeling must typically be done by a person; this is a painfully time-consuming process. This need for large quantities of expensive labeled examples raises the idea of learning classifiers from a combination of labeled and unlabeled data. In general, unlabeled examples are much less expensive and easier to come by than labeled examples. There are many related algorithms originated from this idea in these years. They are called semi-supervised classifiers.

 

            The goal of this project is to implement several algorithms proposed in the recent literature and make an empirical comparison of these methods for a specific situation. That is, we put our attention on the problem sets include only two classes and have a highly unbalanced class distribution. Moreover, the asymmetric misclassification costs are not given explicitly in the problem. The asymmetric misclassification cost is meant that one of the class values is the target class value for which we want to get predictions and we prefer false positive over false negative.

 

            The reason of this study is originated from the fact that: in some application areas, like image classification and protein interaction network, positive class and negative class distribute very skew ratio within real data. It is very time-consuming or even impossible to get enough labeled data for the classification. But large amount of unlabeled examples can be collected very fast and efficiently. For this specific situation, applying semi-supervised classification idea seems very natural. We hope this project can make some demonstration of the difference & similarity of some recent semi-supervised classifiers on this specific case.

 

            To better understand the related previous research, we made two comprehensive literature review, one about imbalanced set classification, the other about semi-supervised learning. Actually our review about the semi-supervised learning includes more than semi-supervised classification emphasis in this lab. Based on our review, we decide to choose three semi-supervised classifiers and test their performance in our project:

 

  1. Transductive Support Vector Machine

·        T. Joachims.(1999). Transductive inference for text classification using support vector machines. Proceedings of the 16th International Conference on Machine Learning, pages, 1999

·        Semi-supervised learning by maximizing margin by the helping of unlabeled data

 

  1. Semi-supervised learning by manifolds neighbor information

·        Semi-supervised learning using Gaussian Fields and Harmonic Function

·        X. Zhu, et al. (2003) Semi-Supervised learning using Gaussian Fields and Harmonic Functions.  ICML 2003

 

  1. MLE – EM for learning from labeled and unlabeled data

·        David J. Miller, et al.(1996).  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data. NIPS 1996

·        Semi-supervised learning of mixture models by focusing on maximum-likelihood estimators and generative models

 

 

 


Yanjun Qi, 2004-09-05