Guideline:
Abstract Data Analysis Reference
The detailed projects involve 3 steps: 1. overview on literature; 2. experiments on simulated data; 3. experiments on real application data.
Task |
Done
By |
Status |
Project
Proposal |
Sep. 4th |
Completed
|
| Proposal Presentation | Sep. 25th | Completed |
|
Sep. 30th |
Completed |
Generate synthetic data for different conditions and study the behaviors |
Oct. 4th |
Completed |
Mid-course presentation for projects |
Oct.
10th |
Completed |
|
Statistics: The dataset consists of 1356 sequences by 13 classes, one sequence has one and only one label. The class-sequence distribution is shown as below. From the statistics above, we can see this is classification problem for multi-class and unbalanced data.
|
Family Name |
Num of Sequences |
![]() |
|
Class A |
1081 |
|
|
Class B |
83 |
|
|
Class C |
28 |
|
|
Class D |
11 |
|
|
Class E |
4 |
|
|
Class F |
45 |
|
|
OrphanA |
35 |
|
|
OrphanB |
2 |
|
|
Drosophila_Odorant_Receptors |
31 |
|
|
Bacterial_Rhodopsin |
23 |
|
|
Nematode_Chemoreceptors |
1 |
|
|
Plant_Mlo_Receptors |
10 |
|
|
Ocular_Albinism_Proteins |
2 |
Statistics: The dataset consists of 7769 document in training and 3019 document in testing, mapped to 90 categories. Each document has at least one labels. The averaged cat-to-doc ratio is 1.27 for training set. The category-document distribution is shown as below:

Since each document can have multiple labels, the table below summarizes the document-label distributions. From the analysis above, we can see this is a classification problem for multi-label and unbalanced data.
|
Labels per document |
Training data |
![]() |
|
1 |
6577 |
|
|
2 |
865 |
|
|
3 |
192 |
|
|
4 |
59 |
|
|
5 |
37 |
|
|
6 |
22 |
|
|
7 |
5 |
|
|
8 |
5 |
|
|
9 |
3 |
|
|
10 |
2 |
|
|
11 |
1 |
|
|
12 |
0 |
|
|
13 |
0 |
|
|
14 |
0 |
|
|
15 |
1 |
Useful Links: