INSTRUCTOR:
Prof. Yiming Yang, yiming [at] cs.cmu.edu
TIME AND
LOCATION: TR, 12:00  1:20pm, HH B131
This is a fullsemester
lectureoriented course (12 units) for the PhDlevel, MSlevel and
undergraduate students who meet the prerequisites. It offers a blend of core
theory, algorithms, evaluation methodologies and applications of scalable data
analytic techniques. Specifically, the covered topics include
·
Link Analysis
·
Collaborative Filtering
·
Socialmedia Analysis
·
Webscale Text Classification
·
Learning to Rank for Information
Retrieval
·
Deep Learning for Text Analysis
·
Matrix factorization (with SVD,
nonnegative and probabilistic matrix completion)
·
Stochastic gradient descent
·
Statistical significance tests
·
CS
courses on data structures, algorithms and programming (e.g. 15213), linear
algebra (e.g. 21241 or 21341) and intro probability (e.g. 21325)
·
Intro
Machine Learning (e.g., 10701 or 10601) and Algorithm Design and Analysis
(e.g., 15451) are not required but helpful
TEXTBOOKS
(SOME CHAPTERS):
Introduction to Information
Retrieval (IR), Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge
University Press. 2008. The textbook can be purchased at the CMU bookstore.
ONLINE
READING AND LECTURE SLIDES: There are selected additional readings,
which are online available, restricted to the .cmu.edu
domain. CMU people can get access from outside .cmu.edu
(e.g., from home) using CMU's WebVPN Service. Access of
slides has the same requirement.
TEACHING
ASSISTANTS AND OFFICE HOURS
·
Guokun Lai, guokun
[at] cs.cmu.edu
Office Hour: Friday 13:3014:30 at GHC 5417
·
Yuexin
Wu, yuexinw [at] cs.cmu.edu
Office Hour: Thursday 16:0017:00 at GHC 5417
EXAMS: We
will not have midterm and final exams. Instead, we have quizzes between the
lectures, and the Capstone Project Proposal (CPP) presentations in the late
part of the course.
COURSE
POLICIES:
Late
Homework, Laptops,
Cheating
(form
to sign)
GRADING
POLICIES:
Although the lectures are the same for all the students, the required work for
homework assignments and the CPP work differs by course, because 11741 is a
12unit PhD level course, 11641 is a 12unit Masterlevel course, and 11441
is a 9unit undergraduate course. Graduate students can choose either 11741 or
11641; undergraduate students should take 11441. Exception is possible if
approved by the instructor. The table below lists the credits (%’s) of course
work including quizzes, CPP and HW assignments, respectively. Quizzes are
required for all students, mounting to 15% of the total credits for each
student. CPP oral presentations are required only for 11741/641 students but
peerreviews are required for all the students. For HW assignments, 11741 students
are required to do all of them; 11641 students may choose a subset whose
percentages sum to 70%; 11441 students may also choose a subset whose
percentages sum to 75%. If the chosen subset has a total percentage exceeding
the required total (70% for 11641 and 75% for 11441), we will discard the
score(s) of his/her worstperforming HW assignment(s). For example, if a 11641
student has his/her k betterscored HW assignments with the percentage total of
65%, then the next 5% will come from the k+1 HW assignment in the sorted list
based on his/her actual scores of the HW assignments.
11741 (PhD Level) 
11641 (MS Level) 
11441 (UG Level) 

Quizzes (Mandatory) 
15% 
15% 
15% 
CPP (Mandatory) 
15% Team Presentation + Peer
Review 
15% Team Presentation + Peer
Review 
10% Peer Review Only 
HW0 
0% 
0% 
0% 
HW1 
7% 
8.5% 
10% 
HW2 
10% 
12% 
14% 
HW3 
13% 
15% 
18% 
HW4 
7% 
8% 
10% 
HW5 
13% 
15% 
18% 
HW6 
7% 
8% 
10% 
HW7 
13% 
15% 
18% 
1)
QUIZZES:
We will have 5~6 quizzes over the course, each of which takes about 10~20
minutes and focuses on the contents in the most recent few lectures. The
quizzes should be relatively easy; the students actively participate lectures
are expected to do well with them.
2)
CAPSTONE PROJECT PROPOSAL (CPP)
Graduate
students (in 11641 or 11741) will be organized into teams, and each team will
be assigned to one of the topics listed below (together with the assigned
papers per topic), give a classroom oral presentation during the semester. The
teaming and topic assignments will be done by the TAs/instructor, based on the
submitted preferences of individual students over topics (each person may
submit 3 topics in order) ( CPP
guidelines ). The oral presentations and the written reports will be
peerreviewed.
·
Topic
1. Semisupervised Learning, Kingma
et. al. NIPS 2014; Salimans
et. al. NIPS 2016; Miyato
et. al. ICLR 2017;
·
Topic
2. Link Prediction in Citation Networks, Chang
& Blei, AISTATS 2009; Backstrom
& Leskovec, WSDM 2011; Liu &
Yang, ICML 2015;
·
Topic
3. Collaborative Filtering, Karatozoglou,
RecSys 2010; Zhou et
al, SIGIR 2011; Zheng et al,
ICML 2016
·
Topic
4. Knowledgebase Completion, Liu et.
al. ICML 2017; Trouillon
et. al. ICML 2016; Nickel et.
al. AAAI 2016; Yang et. al.
ICLR 2015; Bordes
et. al. NIPS 2013
·
Topic
5. Graph Embedding, Perozzi et.
al. KDD 2014; Tang et.
al. WWW 2015; Cao et.
al. CIKM 2015; Grover &
Leskovec KDD 2016
·
Topic
6. Question Answering and Reading Comprehension, Rajpurkar et. al. EMNLP 2015; Chen et. al. ACL 2016; Jia et. al. EMNLP 2017
·
Topic
7. Seq2Seq Models, Bahdanau et.
al. ICLR 2015; Kim et. al.
ICLR 2017; Vaswani et. al.,
NIPS 2017
·
Topic
8. Deep Learning for Text Classification, Kim,
EMNLP 2014; Zhang
et al., NIPS 2015; Johnson
& Zhang, ICML 2016; Yang et
al., NAACL 2016;
·
Topic
9. Deep Learning for Sentiment Detection, Maria et
al., SemEval 2016; Wang et
al., COLING 2016; Santos
& Gatti, COLING 2014; Wang
et al., EMNLP 2016;
3)
HOMEWORK ASSIGNMENTS: Handson experiences on clustering, recommender
systems, link analysis, classification, learningtorank and significance
testing, etc.
·
HW0:
A problem solving set for selfassessment by students as well as for checking
the related background. The answers are for reference only, not for grading.
·
HW1:
A problem solving set using a modified version of HW0, to improve the related
background in matrix algebra, calculus and probabilities.
·
HW2:
A programming assignment for link analysis with PageRank, Personalized PageRank
and Query Sensitive PageRank, and evaluating the retrieval results of these
methods on the CiteEval dataset.
·
HW3:
A programming assignment for collaborative filtering, focusing on implementing
memorybased method and modelbased methods to predict the ratings of movies,
and evaluating them on a subset of the Netflix Prize dataset.
·
HW4:
A (lighter) programming assignment for text classification with Support Vector
Machines (SVMs) and a stochastic gradient descent (SGD) training algorithm on
processed data.
·
HW5:
A programming assignment for text classification (rating prediction) on a large
dataset of Yelp reviews. The implementation of a multiclass logistic
regression (softmax) method is required, while existing software (LIBLINEAR)
can be used for SVM.
·
HW6:
A problemsolving set for handson exercise with statistical significance
tests, including sign test, ttest, proportion test, signedrank test and
ranksum.
·
HW7:
A programming assignment for text classification on the same Yelp review
dataset with deep learning, including word embedding, convolutional neural net
(CNN) and recurrent neural net (RNN) components. Existing software like
TensorFlow or Keras can be used.
Course
Syllabus
(Slides)
# 
Date 
Lecture 
Reading 
Homework 
1 
15Jan 
Course overview and introduction 
HW0 (link) (submission is not required;
receiving feedback if submitted) 

2 
17Jan 
Highdimensional vectors and
scalable indexing 
"IR: Ch 1.1, 1.2, 6.1, 6.2,
6.3" 
HW1 (link)
(Due 1/23 11:59PM) 
3 
22Jan 
Link Analysis 1: HITS and PageRank 
IR: Ch 21 

4 
24Jan 
Link Analysis 2: Personalized and
Topicsensitive PageRank 
"IR: Ch 8.1 – 8.4; Haveliwala,
WWW2002" ) 

5 
29Jan 
Link Analysis 3: Evaluation of
Ranked Lists; Eigensystems (1st part of the 08SVD slides) 
IR: Ch 8 

31Jan 
(Class
canceled due to cold weather) 


6 
2/5 
Quiz 1 (Lec 2 – 5); Collaborative Filtering (CF1):
Memorybased and Itembased 


7 
2/7 
CF2: Modeling Latent Factors 

8 
2/12 
Matrix Factorization 1: SVD
(Singular Value Decomp.) 
IR: Ch 18 
HW3 Link
(Due 2/25 11:59 PM) 
9 
2/14 
Matrix Factorization 2:
Nonnegative, Probabilistic 

10 
2/19 
CF3: Socialmedia Analysis 

11 
2/21 
Quiz 2 (Lec 6 – 10); Classification 1. Support Vector
Machine (SVM) 
ML
textbook by Bishop: Ch 7.1 (pp 325345) 

12 
2/26 
Stochastic Gradient Descent (SGD) 

13 
2/28 
Classification 2. Logistic
Regression (LR) 
IR: Ch 15 

14 
3/5 
Classification 3. Extended
Concepts; Evaluation Metrics 

15 
3/7 
Classification 4. Extreme
Classification with Structured Learning 


3/1115 
Spring
Break (No Classes) 


16 
3/19 
Quiz 3 (Lec 11 – 15); Graphbased Learning 1 

17 
3/21 
Graphbased Learning 2 
Invited lecture by Yuexin Wu 

18 
3/26 
"Significance Testing 1: sign
tests, ttests “ 

19 
3/28 
"Significance Testing 2:
signed rank, rank sum" 
"HW6 (link, resourcesi&template)
(Due 4/30 11:59PM)" 

20 
4/2 
Significance Testing 3:
permutation tests，etc. 

21 
4/4 
Quiz 4 (Lec 16 – 20); Deep Learning 1: Word Embedding 

22 
4/9 
Deep Learning 2: Recurrent Neural
Network 


4/11 
No Class 


23 
4/16 
Deep Learning 3: Convolutional
Neural Networks 
"HW7 (link, data, report template) (Due 4/30 11:59PM)" 

24 
4/18 
Quiz 5 (Lec 21 – 23); Deep Learning 4: Contextualized Text Representations 
Invited lecture by Zihang Dai 


4/23 
CPP Oral (Two Teams) 


4/25 
CPP Oral (Two Teams) 


4/30 
CPP Oral (Two Teams) 


5/2 
CPP Oral (Two Teams) 
Take
care of yourself. Do your best to maintain a healthy lifestyle this semester by
eating well, exercising, avoiding drugs and alcohol, getting enough sleep and
taking some time to relax. This will help you achieve your goals and cope with
stress.
All of us
benefit from support during times of struggle. You are not alone. There are
many helpful resources available on campus and an important part of the college
experience is learning how to ask for help. Asking for support sooner rather
than later is often helpful.