CMU logo

11-741: Information Retrieval

LTI logo

 

Description:

This course studies the theory, design, and implementation of text-based information systems. The Information Retrieval core components of the course include statistical characteristics of text, representation of information needs and documents, several important retrieval models (Boolean, vector space, probabilistic, inference net, language modeling), clustering algorithms, automatic text categorization, and experimental evaluation. The software architecture components include design and implementation of high-capacity text retrieval and text filtering systems. A variety of current research topics are also covered, including cross-lingual retrieval, document summarization, machine learning, topic detection and tracking, and multi-media retrieval.

Prerequisites:

  • Programming and data-structures at the level of 15-211 or higher.
  • Algorithms comparable to the undergraduate CS algorithms course (15-451) or higher.
  • Basic linear algebra (21-241 or 21-341).
  • Basic statistics (36-202) or higher.

Time & Location:

TR 12:00-1:20pm, Wean Hall 4623

Instructors:

Jamie Callan and Yiming Yang

Instructor Office Hours:

By appointment

Teaching Assistant(s):

Abhi Lad and Grace Hui Yang

TA Office Hours:

By appointment, and
Tuesday, 2:00-3:00, NSH 4506 (Grace),
Wednesday, 2:30-3:30, NSH 3612C (Abhi)

Textbook:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008 (November 17, 2007 draft). You may view the textbook online, or print your own copy. The instructors will also arrange for a limited number of bound copies to be purchased at the CMU book store.

Other Readings:

Selected papers or book chapters will be assigned reading for some lectures. All will be available online and/or on reserve in the Engineering and Science Library, 4th floor, Wean Hall. Some of the books used in the course are listed below.

  • Hastie: The Elements of Statistical Learning. T. Hastie, R. Tibshirani, and J. Friedman. (2001) Springer. New York.
  • MG: Managing Gigabytes. I.H. Witten, A. Moffat, and T.C. Bell. 2nd edition. (1999), Morgan Kaufmann.
  • SNLP: Foundations of Statistical Natural Language Processing, C. Manning and H. Schutze. (1999), MIT Press.

Course Notes:

Usually available online, occasionally distributed in lectures. Online access is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using VPN or CMU's WebVPN Service.

Homework:

1 brief reading summary per week (1/2 - 1 page), and 5 problem sets or programming assignments. This is subject to change (but it probably won't). Submission guidelines

Grading:

Grades will be based on 5 problem sets / programming assignments sets (10% each, 50% total), weekly summaries of readings (10% total), a midterm exam (20%) and a final exam (20%).

Course Policies:

Attendance, Late homework, Cheating, Recording & videotaping

Sitting In:

Approval from the instructors is required.

Syllabus:

The anticipated syllabus is below. It is subject to change.
 

Lecture

Day

Important
Events

Topic

Readings

1.

1/15

 

Course overview (pdf)

 

2.

1/17

 

Introduction to ad-hoc search: Boolean retrieval (pdf)

Ch 1

3.

1/22

 

Text representation (pdf)

Ch 2.0-2.2

4.

1/24

 

Text representation (pdf), index construction (pdf)

Ch 4

5.

1/29

 

Index construction (pdf)

Ch 2.3-2.4, 3.2, 5.1, 5.3

6.

1/31

 

Index construction (pdf); web indexing (pdf)

 

7.

2/5

 

Information needs and queries (pdf)
Class is moved to NSH 1305, this day only

 

8.

2/7

HW1 out

Evaluation (pdf)

Ch 8

9.

2/12

 

Retrieval models: Vector space (pdf),
Relevance feedback (pdf)

Ch 9

10.

2/14

HW1 due
HW2 out

Query expansion (pdf),
Retrieval models: Probabilistic model (pdf)

Ch 11

11.

2/19

 

Retrieval models: Statistical language models (pdf)

Ch 12, Zhai & Lafferty

12.

2/21

 

Retrieval models: Structured documents, inference network (pdf)

Ch 10; Metzler

13.

2/26

 

Dimensionality reduction Lectures (pdf)

Ch 18

14.

2/28

 

Retrieval models: Hypertext (pdf)

Kleinberg, ACM-SIAM'98; Ng, et al., IJCAI'01

15.

3/4

HW2 Due

Search log analysis (pdf)

Agichtein

 

3/6

 

Midterm Exam

2008 midterm, 2007 midterm, 2006 midterm

 

3/11

 

Spring Break!

 

 

3/13

 

Spring Break!

 

16.

3/18

 

Query classification, federated search (pdf)

Callan (R9)

17.

3/20

HW3 out

Collaborative filtering (pdf)

Shardanand, CHI'95; Si & Jin, ICML'03 (R9)

18.

3/25

 

Learning empirical associations (pdf)

Yang ICML'97; Forman, JMLR'03   (R10)

19.

3/27

 

Document clustering I (pdf)

Ch 16 (R11)

20.

4/1

HW3 due
HW4 out
R11 due

Document clustering II (pdf)

Ch 17 (R11)

21.

4/3

 

Information extraction: Hidden Markov Models (pdf)

Seymore, AAAI'99 workshop (R12)

22.

4/8

 R12 due

Introduction to text categorization (pdf)

Ch 13 (R10)

23.

4/10

 

Naive Bayes methods (pdf)

McCallum & Nigam, AAAI Workshop, 1998 (R12)

24.

4/15

HW4 due
HW5 out
R13 due

Significance tests (pdf)

Yang & Liu SIGIR'99 (R13)

 

4/17

 

Mid Semester Break

 

25.

4/22

  R14 due

Linear regression (pdf)

Ch 14 (R14)

26.

4/24

 

Nearest neighbor (pdf)

Goldberger, NIPS '04 (R14)

27.

4/29

R15 due

Support Vector Machines (pdf)

Ch 15 (R15)

 

4/30

HW5 due

 

 

28.

5/1

 

Large-scale text categorization (pdf)

Yang el at. SIGIR'03; Liu el at. SIGKDD'05 (R15)

 

5/12

 

Final Exam, 1:00-4:00pm, PH125C

2005 final; 2006 final


Updated on March 25, 2008.

Jamie Callan, Yiming Yang,