
11443/11643: Scalable Analytics

Description: 
This is a fullsemester lectureoriented course (12 units), intended for students in professional master programs and undergraduates who meet the prerequisites. Replacing the 2nd half of 11641/11441, Search Engines and Web Mining, this new course offers a blend of core theory, implementation and application of scalable data analytic techniques. Specifically, it covers highdimensional data representation, dimensionality reduction, clustering, collaborative filtering, large scale classification, learning to rank, link analysis, temporal information distillation, and statistical significance tests. Homework assignments (6) give handson experiences to students by implementing representative algorithms, conducting empirical evaluations, and exercising the main concepts taught in the course. 
Prerequisites: 
· 15213, Introduction to Computer Systems (required) · 21241, Matrix Algebra or 21341, Linear Algebra (required) · 21325, Probability (required) · 15451, Algorithm Design and Analysis (not required but helpful) · 10601 or 10701, Machine Learning (not required but helpful) For CMU CS undergraduates, all of the required courses need to be completed before or during the junior year; for MS students, equivalent background is required. 
Time & Location: 
MW, 3:00  4:20pm, GHC 4211 
Instructor(s): 

Teaching Assistants: 
Andrew Hsi (andrew.hsi90[at]gmail.com) 
Instructional Materials: 
· Primary: Introduction to Information Retrieval (IR), Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008 · Reference: Pattern Recognition and Machine Learning (ML), Christopher M. Bishop, Springer 2006 The textbooks can be purchased at the CMU book store. There are selected additional readings, which will be made available online. Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service. 
Homework: 
· HW0 (0%): A problemsolving set for selfassessment by students as well as for checking related background by the instructor with respect to the prerequisite skills for this course. The answers will be used for reference only, not for grading the students. · HW1 (10%): A modified version of the HW0 problemsolving set, to improve the students’ background in related matrix algebra, calculus and probabilities. · HW2 (10%): A programming assignment for bipartite clustering, especially using kmeans to discover the latent clusters of words and the latent clusters of documents in a mutually reinforcing manner, Evaluate the results using both puritybased metrics and human annotated clusters. · HW3 (10%): A programming assignment for collaborative filtering, especially implementing memorybased method and modelbased methods to predict the ratings of movies, and evaluating them on a subset of the Netflix Prize dataset. This assignment has been used in 11441 / 11641 and 11741. HW4 (10%): A programming assignment for statistical classification, especially implementing regularized logistic regression (RLR) with gradient accent, testing it in comparison with Support Vector Machines (existing software available) on a subset of the RCV1 benchmark corpus of news stories. This assignment has been used in 11441 / 11641. · HW5 (10%): A programming assignment for learning to rank (LETOR), including the implementation of regularized logistic regression (RLR) with gradient accent, the experiments of the adapted RLR and SVM for LETOR, on a dataset from MSRA. This assignment has been used in 11441 / 11641. · HW6 (10%): A programming assignment for handon exercise with statistical significance tests, including sign test, ttest, proportion test, signedrank test and ranksum (and permutation test if not too heavy already), and compare classifiers on the RCV1 subset in HW3. 
Exams: 
Midterm exam (15%) will be closedbook. Final exam (25%) will be openbook, in the form of Capstone Project Proposal (CPP) with a classroom presentation (CPP guidelines) and a peerreview evaluation (CPP Evaluation Form). Candidate topics include but not limited to the follows: a) Largescale Hierarchical
Text Categorization Challenges (2010 – 2014) http://lib.iit.demokritos.gr/, (LSHTC 2010) b) Scalable Classification
with Feature Selection, Feature Hashing and Multilabeled Compressed Sensing
(Yang & Pedersen, ICML 1997 ; Shi et al, JMLR 2010 ; Hsu
et al, NIPS 2009 ) c) Largescale Collaborative Filtering (Yu et al . SIGIR 2009 ; Linden et al., IEEE 2003) d) Personalized Active
Learning for Collaborative Filtering
(Harpale & Yang, SIGIR 2008) (Jin & Si, UAI 2004) e) Simisupervised Clustering and Metric Learning (Bilenko et al., ICML 2004) (Xing et al., NIPS 2002) f) Novelty Detection & Distillation over Temporal Streams (Allan et al., TDT Workshop 2001) (Yang et al., SIGIR 2007 ) g) Learning/Optimization for Computational Advertisement 
Grading: 
60% homework (6 programming assignments), 15% midterm, 25% final. 
Course policies: 

Syllabus (Tentative): 
Slides will be available at the time when the lectures
proceed. 
HW0 out
HW1 out: Problem Solving Set 1/20, No class (Martin Luther King Day)
HW0 Due
HW1 Due
HW2 out (bipartite clustering on news stories)
HW2 Due; HW3 out
HW3 Due; HW4
out (SVM and LR)
3/5, Midterm exam 3/10, Spring Break 3/12, Spring Break
HW4 Due; HW5 out (LETOR);
CPP Initial Ideas Due
HW5 Due; HW6 out (sig tests)
HW6 Due
CPP First Version Due

