From the classification performance on selected 7 UCI date set with different class ratio, it seems that the performance is not mainly related with the imbalance ratio. To make this point clear, we make some investigation of the distribution of each set. So we use SVD to do the feature dimensional reduction and visualize these 7 sets on their first 3 principle components.
From the following figures, and together with the classification performance summary, we would still keep our conclusion that:
|
Dataset |
%
Minority Examples |
Dataset
Size |
FEATURE /
Class Situation |
CLASS USEd |
Unlabel
data size in EAch
Experimental Run |
|
Letter-a |
3.9 |
20000 |
16 numeric (integer) features 17 classes |
Letter “A” against all other letter |
2000 |

|
Dataset |
%
Minority Examples |
Dataset
Size |
FEATURE /
Class Situation |
CLASS USEd |
Unlabel
data size in EAch
Experimental Run |
||
|
Pendigits |
8.3 |
7494 |
16 attributes (All input attributes are integers 0..100) 10 classes |
Digits “0” against all other digits |
2000 |
||
|
|
|
|
|
|
|
|
|

|
Dataset |
% Minority Examples |
Dataset Size |
FEATURE / Class Situation |
CLASS USEd |
Unlabel data size in EAch Experimental Run |
|
Letter-a-subset |
17.0 |
4639 |
16 numeric (integer) features 17 classes |
Letter “A” against Letter “BCDEF” |
2000 |

|
Dataset |
% Minority Examples |
Dataset Size |
FEATURE / Class Situation |
CLASS USED |
Unlabel data size in EAch Experimental Run |
|
Yeast |
28.9 |
1484 |
8 attributes (numerical ) 10 classes |
“NUC” against all
the other localizations (429 positive) |
1350 |

|
Dataset |
% Minority Examples |
Dataset Size |
FEATURE / Class Situation |
CLASS USED |
Unlabel data size in EACH Experimental Run |
|
Pima |
34.7 |
768 |
8 attributes ( numerical ) 2 classes |
( 268 positive) |
650 |

|
Dataset |
% Minority Examples |
Dataset Size |
FEATURE / Class Situation |
CLASS USED |
Unlabel data size in Each Experimental Run |
|
Bupa |
42.0 |
345 |
6 attributes (numerical ) 2 classes |
(145 positive) |
240 |

|
Dataset |
% Minority Examples |
Dataset Size |
FEATURE / Class Situation |
CLASS USED |
Unlabel data size in EAch Experimental Run |
|
Pendigits -Subset |
50.0 |
1438 |
16 numeric (integer) features 17 classes |
Digit “3” against digits “9” (719 positive) |
1300 |
