IR Seminar project by Yan Liu

Basic Information

Guideline:

Abstract    Data Analysis    Reference

 


Abstract:

 Support Vector Machines, as a kernel-based methods, have been successfully applied to many applications, including text classification and protein classification.  People have put a lot of efforts in designing kernels for different applications. However, there have been little study on how those kernels behave under different conditions, such as rare-class problems, multi-class problem and data with noise, which are in fact common problems in real applications. In this project, I will focus on several standard kernels, and study their behaviors under the above conditions.

The detailed projects involve 3 steps: 1. overview on literature; 2. experiments on simulated data; 3. experiments on real application data.

 

Task

Done By

Status

Project Proposal

Sep. 4th

Completed

Proposal Presentation Sep. 25th Completed

Read papers on kernel methods and make decisions on what kernels future study will be focused on

Sep. 30th

Completed

Generate synthetic data for different conditions and study the behaviors

Oct. 4th

Completed

Mid-course presentation for projects

Oct. 10th

Completed

Coming Soon...

   

 


 

Data Analysis

            Statistics: The dataset consists of  1356 sequences by 13 classes, one sequence has one and only one label. The class-sequence distribution is shown as below. From the statistics above, we can see this is classification problem for multi-class and unbalanced data.

Family Name

Num of Sequences

Class A

1081

Class B

83

Class C

28

Class D

11

Class E

4

Class F

45

OrphanA

35

OrphanB

2

Drosophila_Odorant_Receptors

31

Bacterial_Rhodopsin

23

Nematode_Chemoreceptors

1

Plant_Mlo_Receptors

10

Ocular_Albinism_Proteins

2

   

            Statistics:  The dataset consists of 7769 document in training and 3019 document in testing, mapped to 90 categories. Each document has at least one labels. The averaged cat-to-doc ratio is 1.27 for training set. The category-document distribution is shown as below:

            Since each document can have multiple labels, the table below summarizes the document-label distributions. From the analysis above, we can see this is a classification problem for multi-label and unbalanced data.

Labels per document

Training data

1

6577

2

865

3

192

4

59

5

37

6

22

7

5

8

5

9

3

10

2

11

1

12

0

13

0

14

0

15

1

 


Reference:

Useful Links:


by Yan Liu last updated at 10/20/2003 11:46 PM -0400