IR-Lab Project of Yanjun Qi


 

Some further analysis of each data set

 

            From the classification performance on selected 7 UCI date set with different class ratio, it seems that the performance is not mainly related with the imbalance ratio. To make this point clear, we make some investigation of the distribution of each set. So we use SVD to do the feature dimensional reduction and visualize these 7 sets on their first 3 principle components.

 

            From the following figures, and together with the classification performance summary, we would still keep our conclusion that:

 

 

1.Data Set 1

 

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USEd

Unlabel data size in

EAch Experimental Run

Letter-a

3.9

20000

16 numeric (integer) features

17 classes

Letter “A” against all other letter

2000

 

 

 

 

 

 

2.Data Set 2

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USEd

Unlabel data size in

EAch Experimental Run

Pendigits

8.3

7494

16 attributes

(All input attributes are integers 0..100)

10 classes

Digits “0” against all other digits

2000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3.Data Set 3

 

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USEd

Unlabel data size in

EAch Experimental Run

Letter-a-subset

17.0

4639

16 numeric (integer) features

17 classes

Letter “A” against Letter “BCDEF”

2000

 

 

 

 

 

 

 

4.Data Set 5

 

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USED

Unlabel data size in

EAch Experimental Run

Yeast

28.9

1484

8 attributes (numerical )

10 classes

“NUC” against all the other localizations (429 positive)

1350

 

 

 

 

 

 

 

 

5.Data Set 6

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USED

Unlabel data size in

EACH Experimental Run

Pima

34.7

768

8 attributes ( numerical )

2 classes

( 268 positive)

650

 

 

 

 

 

 

6.Data Set 7

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USED

Unlabel data size in

Each Experimental Run

Bupa

42.0

345

6 attributes (numerical )

2 classes

(145 positive)

240

 

 

 

 

 

 

 

 

7.Data Set 8

 

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USED

Unlabel data size in

EAch Experimental Run

Pendigits -Subset

50.0

1438

16 numeric (integer) features

17 classes

Digit “3” against digits “9” (719 positive)

1300