Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: CMU logo

11-443/11-643: Scalable Analytics
Spring 2014



This is a full-semester lecture-oriented course (12 units), intended for students in professional master programs and undergraduates who meet the pre-requisites.  Replacing the 2nd half of 11-641/11-441, Search Engines and Web Mining, this new course offers a blend of core theory, implementation and application of scalable data analytic techniques.  Specifically, it covers high-dimensional data representation, dimensionality reduction, clustering, collaborative filtering, large scale classification, learning to rank, link analysis, temporal information distillation, and statistical significance tests. Homework assignments (6) give hands-on experiences to students by implementing representative algorithms, conducting empirical evaluations, and exercising the main concepts taught in the course.



·         15-213, Introduction to Computer Systems (required)

·         21-241, Matrix Algebra or 21-341, Linear Algebra (required)

·         21-325, Probability (required)

·         15-451, Algorithm Design and Analysis (not required but helpful)

·         10-601 or 10-701, Machine Learning (not required but helpful)


For CMU CS undergraduates, all of the required courses need to be completed before or during the junior year; for MS students, equivalent background is required.

Time & Location:

MW, 3:00 - 4:20pm, GHC 4211


Yiming Yang

Teaching Assistants:

Andrew Hsi (andrew.hsi90[at]

Instructional Materials:

·         Primary: Introduction to Information Retrieval (IR), Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008

·         Reference: Pattern Recognition and Machine Learning (ML), Christopher M. Bishop, Springer 2006

The textbooks can be purchased at the CMU book store. There are selected additional readings, which will be made available online.  Online access to some materials is restricted to the domain. CMU people can get access from outside (e.g., from home) using CMU's WebVPN Service.


·         HW0 (0%): A problem-solving set for self-assessment by students as well as for checking related background by the instructor with respect to the pre-requisite skills for this course.  The answers will be used for reference only, not for grading the students.

·         HW1 (10%):  A modified version of the HW0 problem-solving set, to improve the students’ background in related matrix algebra, calculus and probabilities.

·         HW2 (10%):  A programming assignment for bipartite clustering, especially using k-means to discover the latent clusters of words and the latent clusters of documents in a mutually reinforcing manner,  Evaluate the results using both purity-based metrics and human annotated clusters.

·         HW3 (10%): A programming assignment for collaborative filtering, especially implementing memory-based method and model-based methods to predict the ratings of movies, and evaluating them on a subset of the Netflix Prize dataset. This assignment has been used in 11-441 / 11-641 and 11-741.

HW4 (10%): A programming assignment for statistical classification, especially implementing regularized logistic regression (RLR) with gradient accent, testing it in comparison with  Support Vector Machines (existing software available) on a subset of the RCV1 benchmark corpus of news stories. This assignment has been used in 11-441 / 11-641.

·         HW5 (10%): A programming assignment for learning to rank (LETOR), including the implementation of regularized logistic regression (RLR) with gradient accent, the experiments of the adapted RLR and SVM for LETOR, on a dataset from MSRA.

This assignment has been used in 11-441 / 11-641.

·         HW6 (10%):  A programming assignment for hand-on exercise with statistical significance tests, including sign test, t-test, proportion test, signed-rank test and rank-sum (and permutation test if not too heavy already),  and compare classifiers on the RCV1 subset in HW3.


Midterm exam (15%) will be closed-book.  Final exam (25%) will be open-book, in the form of Capstone Project Proposal (CPP) with a class-room presentation (CPP guidelines) and a peer-review evaluation (CPP Evaluation Form).  Candidate topics include but not limited to the follows:

a)      Large-scale Hierarchical Text Categorization Challenges (2010 – 2014),  (LSHTC 2010)

b)      Scalable Classification with Feature Selection, Feature Hashing and Multi-labeled Compressed Sensing (Yang & Pedersen, ICML 1997 ; Shi et al, JMLR 2010 ; Hsu et al, NIPS 2009 )

c)      Large-scale Collaborative Filtering (Yu et al . SIGIR 2009 ; Linden et al., IEEE 2003)

d)     Personalized Active Learning for Collaborative Filtering (Harpale & Yang, SIGIR 2008) (Jin & Si, UAI 2004)

e)      Simi-supervised Clustering and Metric Learning (Bilenko et al.,  ICML 2004) (Xing et al., NIPS 2002)

f)       Novelty Detection & Distillation over Temporal Streams (Allan et al., TDT Workshop 2001) (Yang et al., SIGIR 2007 )

g)      Learning/Optimization for Computational Advertisement



60% homework (6 programming assignments), 15% midterm, 25% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus (Tentative):

Slides will be available at the time when the lectures proceed.


  1. 1/13, Course overview and introduction (slides)

             HW0 out

  1. 1/15, High-dimensional data & scalable indexing (slides) (IR: Ch 1.1, 1.2, 6.3) 

             HW1 out: Problem Solving Set

1/20, No class (Martin Luther King Day)

             HW0 Due

  1. 1/22, Planning Capstone Project Proposal (CPP) (CPP guidelines; CPP candidate topics)
  2. 1/27, Clustering: flat clustering  (slides) (IR: Ch 16)

             HW1 Due

  1. 1/29, Clustering: hierarchical, and bipartite reinforcement   (slides) (IR: Ch 17)

             HW2 out (bipartite clustering on news stories)

  1. 2/3,   Clustering: evaluation  (slides) (IR: Ch 16.3)
  2. 2/5,   Clustering: probabilistic mixture models & EM  (slides) (ML: Ch9; IR: Ch 16.5)
  3. 2/10, Collaborative Filtering: memory-based, model-based (slides) (Su & Khoshgoftaar, AAI 2009 )
  4. 2/12, Collaborative Filtering: dimension reduced (slides)

             HW2 Due; HW3 out

  1. 2/17, Classification: Logistic Regression (gradient, Newton-Raphson, convexity analysis) (ML: Ch4)  (slides)
  2. 2/19, Classification: Logistic Regression (prediction, decision boundaries, regularization, softmax)
  3. 2/24, Classification: Support Vector Machines (loss functions, kernels) (IR: Ch 15) (slides)
  4. 2/26, Classification: Optimization methods for SVM

             HW3 Due; HW4 out (SVM and LR)

  1. 3/3,   LETOR (Learning to Rank): Pairwise Constrained optimization (Rank SVM, logistic regression, etc.) (Joachims, SIGIR 2002) (slides)

3/5,   Midterm exam

3/10, Spring Break

3/12, Spring Break

  1. 3/17, LETOR: Rank-based evaluation metrics, exercise with MSRA data  (slides) (IR: Ch 8)

         HW4 Due;  HW5 out (LETOR); CPP Initial Ideas Due

  1. 3/19, LETOR: List-wise constrained optimization (MAP, SVM-MAP, ListNet) (slides) (Yue, SIGIR 2007)
  2.  3/24, LETOR: Multi-label classification with LETOR (slides) (Yang & Gopal, MLJ 2012)
  3. 3/26, Significance Testing: sign tests, t-tests, proportion tests (Yang & Liu, SIGIR 1999 )
  4. 3/31,  Significance Testing: signed rank, rank sum, permutation tests (slides)

          HW5 Due; HW6 out (sig tests)

  1.  4/2,  Principle Component Analysis and Singular Value Decomposition   (Deerwester, JASIS 1990)  (slides)
  2.  4/7,  Link Analysis: HITS and PageRank (IR: Ch 21) (slides)
  3.  4/9,   Link Analysis: Personalized PageRank, Topic-sensitive PageRank, Query Classification  (Haveliwala, WWW 2002) (slides)
  4. 4/14, Schedule of CPP Presentations and Peer Review Form

             HW6 Due

  1. 4/16, No Class (preparing CPP)

            CPP First Version Due

  1. 4/21, No Class (preparing CPP)
  2. 4/23, No Class (preparing CPP)
  3. 4/25, CPP Presentation & Peer Review
  4. 4/30, CPP Presentation & Peer Review
  5. 5/5, 1-4pm, CPP Presentation & Peer Review


Updated on Feb 17, 2014

Yiming Yang