Analyzing the citation-graph
of academic research articles in Computer Science
Summary: We
analyzed the citation graph of academic articles in the Citeulike dataset using
PageRank, a popular link-analysis approach that is
used for analyzing relative importance of web-pages. In this process, we identified
the most well-received papers in Computer Science overall. We also identified
that the citation structure of academic articles closely resembles the power-law
distribution of the web-pages in general. Intuitively, a few well-received
papers are highly cited by most other papers. To discover topically
well-received papers, we classified the papers into sub-domains of Computer
Science such as Theoretical Computer Science, Machine Learning/AI, and Web/Information
Retrieval. We used linear Support Vector Machines with SCut-thresholding
scheme to create this multi-labeled classification. We created the training set
for classification by automatically extracting class information of various
academic articles from user-submitted tags on the Citeulike website, and from
the Citeseer classification hierarchy. We analyzed
the categorized dataset using the popular Topic Sensitive PageRank
algorithm and identified authoritative papers in each of the sub-domains of
Computer Science. All software, such as link-analysis algorithms, crawler and
extractors, was implemented in C++. Special data-structures such as inverted
indices were used to speed up the computation of PageRank
algorithms on large citation graphs.
These experiments are the first step in user-centric adaptive and collaborative filtering. Our users (and research community in general) will benefit from the personalized search results that are authoritative at the same time. We will be focusing on personalizing the above experiments in the next step.