IR-Lab Project of Yanjun Qi


 

Data Sets

 

            We downloaded our data from UCI data repository. 7 of them are from UCI Machine learning Repository. The data sets are listed in the following table and with different class imbalance degree.         For each data set, there are various labeled set sizes to be tested: {5, 10, 20, 30, 40, 60, 80, 100}.

 

Table1. The data description

#

Dataset

% Minority  Examples

Dataset Size

FEATURE / Class Situation

CLASS USEd

Unlabel data size in

EAch Experimental Run

1

Letter-a

3.9

20000

16 numeric (integer) features

17 classes

Letter “A” against all other letter

2000

2

Pendigits

8.3

7494

16 attributes

(All input attributes are integers 0..100)

10 classes

Digits “0” against all other digits

2000

3

Letter-a-subset

17.0

4639

16 numeric (integer) features

17 classes

Letter “A” against Letter “BCDEF”

2000

5

Yeast

28.9

1484

8 attributes (numerical )

10 classes

“NUC” against all the other localizations (429 positive)

1350

6

Pima

34.7

768

8 attributes ( numerical )

2 classes

( 268 positive)

650

7

Bupa

42.0

345

6 attributes (numerical )

2 classes

(145 positive)

240

8

Pendigits -Subset

50.0

1438

16 numeric (integer) features

17 classes

Digit “3” against digits “9”

(719 positive)

1300

 

 

 

 


 

Summary about each data set above:

 

1.    Letter Recognition Database

  • From David Slate
  • Based on various fonts
  • 20,000 instances (712565 bytes) (.Z available)
  • 17 attributes: 1 class (letter category) and 16 numeric (integer)
  • No missing attribute values
  • Ftp Access

 

2.    Pen-Based Recognition of Handwritten Digits s

  • From E. Alpaydin, Fevzi Alimoglu
  • 10 classes
  • 7494 training cases, 3498 test cases
  • 16 attributes (All input attributes are integers 0..100)
  • Ftp Access

 

 

3.    Yeast Database

  • Donated by Paul Horton (see also: Ecoli database)
  • Predicting the Cellular Localization Sites of Proteins
  • Documentation: On everything
  • 1484 instances, 8 attributes (one nominal)
  • No missing attribute values
  • Ftp Access

 

 

 

 

4.    Pima Indians Diabetes Database

  • From National Institute of Diabetes and Digestive and Kidney Diseases
  • Binary classes (tested positive or negative for diabetes)
  • All 8 attributes are numeric-valued
  • 768 instances
  • Includes cost data (donated by Peter Turney)
  • Ftp Access

 

 

 

 

 

5.    Liver-disorders Database

  • BUPA Medical Research Ltd. database donated by Richard S. Forsyth
  • 7 numeric-valued attributes
  • 345 instances (male patients)
  • Includes cost data (donated by Peter Turney)
  • Ftp Access

 

 

 


 

Reference:   UCI data repository

    • http://kdd.ics.uci.edu/
    • KDD Archive: Hettich, S. and Bay, S. D. (1999). The UCI KDD Archive [http://kdd.ics.uci.edu]. Irvine, CA: University of California, Department of Information and Computer Science.
    • Machine Learning Archive: Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.