User-centric, Adaptive and Collaborative Information Filtering

(An NSF funded Collaborative Project # III-COR 0704628 & 0704689)

Quarterly Report: 2nd Year, 2nd Quarter

 

 

Analyzing the citation-graph of academic research articles in Computer Science

 

 

Summary: We analyzed the citation graph of academic articles in the Citeulike dataset using PageRank, a popular link-analysis approach that is used for analyzing relative importance of web-pages. In this process, we identified the most well-received papers in Computer Science overall. We also identified that the citation structure of academic articles closely resembles the power-law distribution of the web-pages in general. Intuitively, a few well-received papers are highly cited by most other papers. To discover topically well-received papers, we classified the papers into sub-domains of Computer Science such as Theoretical Computer Science, Machine Learning/AI, and Web/Information Retrieval. We used linear Support Vector Machines with SCut-thresholding scheme to create this multi-labeled classification. We created the training set for classification by automatically extracting class information of various academic articles from user-submitted tags on the Citeulike website, and from the Citeseer classification hierarchy. We analyzed the categorized dataset using the popular Topic Sensitive PageRank algorithm and identified authoritative papers in each of the sub-domains of Computer Science. All software, such as link-analysis algorithms, crawler and extractors, was implemented in C++. Special data-structures such as inverted indices were used to speed up the computation of PageRank algorithms on large citation graphs.

 

These experiments are the first step in user-centric adaptive and collaborative filtering. Our users (and research community in general) will benefit from the personalized search results that are authoritative at the same time. We will be focusing on personalizing the above experiments in the next step.