11-741/11-641/11-441 S19 Syllabus

Machine Learning and Text Mining

INSTRUCTOR: Prof. Yiming Yang, yiming [at] cs.cmu.edu

TIME AND LOCATION: TR, 12:00 - 1:20pm, HH B131

DESCRIPTION

This is a full-semester lecture-oriented course (12 units) for the PhD-level, MS-level and undergraduate students who meet the prerequisites. It offers a blend of core theory, algorithms, evaluation methodologies and applications of scalable data analytic techniques. Specifically, the covered topics include

·        Link Analysis

·        Collaborative Filtering

·        Social-media Analysis

·        Web-scale Text Classification

·        Learning to Rank for Information Retrieval

·        Deep Learning for Text Analysis

·        Matrix factorization (with SVD, non-negative and probabilistic matrix completion)

·        Stochastic gradient descent

·        Statistical significance tests

PREREQUISITES

·        CS courses on data structures, algorithms and programming (e.g. 15-213), linear algebra (e.g. 21-241 or 21-341) and intro probability (e.g. 21-325)

·        Intro Machine Learning (e.g., 10-701 or 10-601) and Algorithm Design and Analysis (e.g., 15-451) are not required but helpful

TEXTBOOKS (SOME CHAPTERS): Introduction to Information Retrieval (IR), Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU bookstore.

ONLINE READING AND LECTURE SLIDES: There are selected additional readings, which are online available, restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service. Access of slides has the same requirement.

TEACHING ASSISTANTS AND OFFICE HOURS

·        Guokun Lai, guokun [at] cs.cmu.edu
Office Hour: Friday 13:30-14:30 at GHC 5417

·        Yuexin Wu, yuexinw [at] cs.cmu.edu
Office Hour: Thursday 16:00-17:00 at GHC 5417

EXAMS: We will not have midterm and final exams. Instead, we have quizzes between the lectures, and the Capstone Project Proposal (CPP) presentations in the late part of the course.

COURSE POLICIES: Late Homework, Laptops, Cheating (form to sign)

GRADING POLICIES: Although the lectures are the same for all the students, the required work for homework assignments and the CPP work differs by course, because 11-741 is a 12-unit PhD level course, 11-641 is a 12-unit Master-level course, and 11-441 is a 9-unit undergraduate course. Graduate students can choose either 11-741 or 11-641; undergraduate students should take 11-441. Exception is possible if approved by the instructor. The table below lists the credits (%’s) of course work including quizzes, CPP and HW assignments, respectively. Quizzes are required for all students, mounting to 15% of the total credits for each student. CPP oral presentations are required only for 11-741/641 students but peer-reviews are required for all the students. For HW assignments, 11-741 students are required to do all of them; 11-641 students may choose a subset whose percentages sum to 70%; 11-441 students may also choose a subset whose percentages sum to 75%. If the chosen subset has a total percentage exceeding the required total (70% for 11-641 and 75% for 11-441), we will discard the score(s) of his/her worst-performing HW assignment(s). For example, if a 11-641 student has his/her k better-scored HW assignments with the percentage total of 65%, then the next 5% will come from the k+1 HW assignment in the sorted list based on his/her actual scores of the HW assignments.

11-741 (PhD Level)

11-641 (MS Level)

11-441 (UG Level)

Quizzes (Mandatory)

15%

15%

15%

CPP (Mandatory)

15% Team Presentation + Peer Review

15% Team Presentation + Peer Review

10% Peer Review Only

HW0

0%

0%

0%

HW1

7%

8.5%

10%

HW2

10%

12%

14%

HW3

13%

15%

18%

HW4

7%

8%

10%

HW5

13%

15%

18%

HW6

7%

8%

10%

HW7

13%

15%

18%

1) QUIZZES: We will have 5~6 quizzes over the course, each of which takes about 10~20 minutes and focuses on the contents in the most recent few lectures. The quizzes should be relatively easy; the students actively participate lectures are expected to do well with them.

2) CAPSTONE PROJECT PROPOSAL (CPP)

Graduate students (in 11-641 or 11-741) will be organized into teams, and each team will be assigned to one of the topics listed below (together with the assigned papers per topic), give a classroom oral presentation during the semester. The teaming and topic assignments will be done by the TAs/instructor, based on the submitted preferences of individual students over topics (each person may submit 3 topics in order) ( CPP guidelines ). The oral presentations and the written reports will be peer-reviewed.

·        Topic 1. Semi-supervised Learning, Kingma et. al. NIPS 2014; Salimans et. al. NIPS 2016; Miyato et. al. ICLR 2017;

·        Topic 2. Link Prediction in Citation Networks, Chang & Blei, AISTATS 2009; Backstrom & Leskovec, WSDM 2011; Liu & Yang, ICML 2015;

·        Topic 3. Collaborative Filtering, Karatozoglou, RecSys 2010; Zhou et al, SIGIR 2011; Zheng et al, ICML 2016

·        Topic 4. Knowledgebase Completion, Liu et. al. ICML 2017; Trouillon et. al. ICML 2016; Nickel et. al. AAAI 2016; Yang et. al. ICLR 2015; Bordes et. al. NIPS 2013

·        Topic 5. Graph Embedding, Perozzi et. al. KDD 2014; Tang et. al. WWW 2015; Cao et. al. CIKM 2015; Grover & Leskovec KDD 2016

·        Topic 6. Question Answering and Reading Comprehension, Rajpurkar et. al. EMNLP 2015; Chen et. al. ACL 2016; Jia et. al. EMNLP 2017

·        Topic 7. Seq2Seq Models, Bahdanau et. al. ICLR 2015; Kim et. al. ICLR 2017; Vaswani et. al., NIPS 2017

·        Topic 8. Deep Learning for Text Classification, Kim, EMNLP 2014; Zhang et al., NIPS 2015; Johnson & Zhang, ICML 2016; Yang et al., NAACL 2016;

·        Topic 9. Deep Learning for Sentiment Detection, Maria et al., SemEval 2016; Wang et al., COLING 2016; Santos & Gatti, COLING 2014; Wang et al., EMNLP 2016;

3) HOMEWORK ASSIGNMENTS: Hands-on experiences on clustering, recommender systems, link analysis, classification, learning-to-rank and significance testing, etc.

·        HW0: A problem solving set for self-assessment by students as well as for checking the related background. The answers are for reference only, not for grading.

·        HW1: A problem solving set using a modified version of HW0, to improve the related background in matrix algebra, calculus and probabilities.

·        HW2: A programming assignment for link analysis with PageRank, Personalized PageRank and Query Sensitive PageRank, and evaluating the retrieval results of these methods on the CiteEval dataset.

·        HW3: A programming assignment for collaborative filtering, focusing on implementing memory-based method and model-based methods to predict the ratings of movies, and evaluating them on a subset of the Netflix Prize dataset.

·        HW4: A (lighter) programming assignment for text classification with Support Vector Machines (SVMs) and a stochastic gradient descent (SGD) training algorithm on processed data.

·        HW5: A programming assignment for text classification (rating prediction) on a large dataset of Yelp reviews. The implementation of a multi-class logistic regression (soft-max) method is required, while existing software (LIBLINEAR) can be used for SVM.

·        HW6: A problem-solving set for hands-on exercise with statistical significance tests, including sign test, t-test, proportion test, signed-rank test and rank-sum.

·        HW7: A programming assignment for text classification on the same Yelp review dataset with deep learning, including word embedding, convolutional neural net (CNN) and recurrent neural net (RNN) components. Existing software like TensorFlow or Keras can be used.

Course Syllabus (Slides)

#

Date

Lecture

Reading

Homework

1

15-Jan

Course overview and introduction

HW0 (link) (submission is not required; receiving feedback if submitted)

2

17-Jan

High-dimensional vectors and scalable indexing 

"IR: Ch 1.1, 1.2, 6.1, 6.2, 6.3"

HW1 (link) (Due 1/23 11:59PM)

3

22-Jan

Link Analysis 1: HITS and PageRank

IR: Ch 21

4

24-Jan

Link Analysis 2: Personalized and Topic-sensitive PageRank

"IR: Ch 8.1 – 8.4; Haveliwala, WWW2002" )

"HW2 ( link , resources ) (Due 2/6 11:59PM)"

5

29-Jan

Link Analysis 3: Evaluation of Ranked Lists; Eigensystems (1st part of the 08-SVD slides)

IR: Ch 8

31-Jan

(Class canceled due to cold weather)

 

6

2/5

Quiz 1 (Lec 2 – 5);

Collaborative Filtering (CF1): Memory-based and Item-based

"Su & Khoshgoftaar, AAI 2009"

 

7

2/7

CF2: Modeling Latent Factors

"Si & Jin, ICML 2003"

8

2/12

Matrix Factorization 1: SVD (Singular Value Decomp.)

IR: Ch 18

HW3 Link (Due 2/25 11:59 PM)

9

2/14

Matrix Factorization 2: Non-negative, Probabilistic

Lee & Seung, NIPS 2001; Salakhutdinov & Mnih, NIPS 2007

10

2/19

CF3: Social-media Analysis

"Chakrabarti et al., ICML 2014"

11

2/21

Quiz 2 (Lec 6 – 10);

Classification 1. Support Vector Machine (SVM)

ML textbook by Bishop: Ch 7.1 (pp 325-345)

12

2/26

Stochastic Gradient Descent (SGD)

"Shalev et al., ICML 2007"

"HW4 (link, resources) (Due 3/6 11:59PM)"

13

2/28

Classification 2. Logistic Regression (LR)

IR: Ch 15

14

3/5

Classification 3. Extended Concepts; Evaluation Metrics

Wikipedia on eval. metrics

15

3/7

Classification 4. Extreme Classification with Structured Learning

"Gopal & Yang, KDD 2013"

"HW5 (link, Write-up, resources) (Due 3/27 11:59PM)"

 

3/11-15

Spring Break (No Classes)

 

 

16

3/19

Quiz 3 (Lec 11 – 15);

Graph-based Learning 1

17

3/21

Graph-based Learning 2

Invited lecture by Yuexin Wu

 18

3/26

"Significance Testing 1: sign tests, t-tests “

"Yang & Liu, SIGIR 1999"

19

3/28

"Significance Testing 2: signed rank, rank sum"

"HW6 (link, resourcesi&template) (Due 4/30 11:59PM)"

 20

4/2

Significance Testing 3: permutation testsetc.

 21

4/4

Quiz 4 (Lec 16 – 20);

Deep Learning 1: Word Embedding

Sanjeev Arora (link1, link2);

Xin Rong 2016

22

4/9

Deep Learning 2: Recurrent Neural Network

Denny Britz, 2015, RNN; Christopher Olah, 2015;

 

4/11

No Class

 

 

23

4/16

Deep Learning 3: Convolutional Neural Networks

Denny Britz 2015, CNN Adit Deshpande, 2016

"HW7 (link, data, report template) (Due 4/30 11:59PM)"

24

4/18

Quiz 5 (Lec 21 – 23);

Deep Learning 4: Contextualized Text Representations

ELMO 2018; BERT 2018

Invited lecture by Zihang Dai

 

4/23

CPP Oral (Two Teams)

 

4/25

CPP Oral (Two Teams)

 

4/30

CPP Oral (Two Teams)

 

5/2

CPP Oral (Two Teams)

Take care of yourself. Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep and taking some time to relax. This will help you achieve your goals and cope with stress.

All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than later is often helpful.