Citeulike
A new dataset for
evaluating User-centric Adaptive and Collaborative Filtering
Summary: Our project is a novel combination of Collaborative Filtering (CF) and Adaptive Filtering (AF) approaches. An ideal benchmark dataset for evaluating such a system should lend itself easily to evaluate both the approaches. We identified that such a dataset should possess following characteristics: documents with textual content, availability of user ratings, public availability, educational importance, time-based ordering of documents, focused tasks, and content stability. After comparing numerous potential datasets like Netflix dataset, del.icio.us dataset, RCV1, TDT4 and TREC datasets, we finally decided to create a custom version of the Citeulike dataset as it satisfied most of the required criteria. The objective of our approaches would be to recommend academic papers to users (researchers) based on their personal preferences. The dataset was created in three phases.
In the first phase, the team at CMU analyzed the publicly available Citeulike dataset and the identified several issues with directly using the dataset, such as spam, robot-users (software programs that automatically submit links/articles), and unavailability of query logs of users. We also analyzed the dataset to identify the distribution of users, articles and tags for the articles to ensure that we will have a sufficient number of users who have rated at least a minimum number of articles and articles that have been rated by at least a minimum number of users. For this phase, analytical and text-processing tools were developed in perl.
In the second phase, the team at University of Pittsburgh designed user-studies and invited volunteers to provide queries and corresponding relevance judgments for the articles in the Citeulike dataset. The volunteers consisted of graduate and PhD students from the various disciplines of Computer Science and Information Systems.
In the third phase, the team at CMU supplemented the dataset with citation information for each of the papers from the Citeseer Open Archives Project, and constructed the complete citation graph of academic articles in Computer Science. We implemented a crawler and citation extractor in C++ for this purpose. The team at CMU then analyzed the quality of the queries and relevance judgments from the 2nd Phase using the popular open source academic retrieval engine Indri. The evaluation software was written in Perl.