Web Search and Mining

Fall 2012


Nov 20: Solution for Homework 3 posted.
Nov 02: Homework 3 posted. Due on Nov 09.
Nov 01: Solution for Homework 2 posted.
Oct 24: Homework 2 posted. Due on Oct 31.
Oct 16: Project posted.
Oct 16: Solution for Homework 1 posted.
Sep 27: Homework 1 posted. Due on Oct 10.
Sep 11: Course website launched.



Wu-Jun Li (;; Rm 3-537, SEIEE Building; 34206661)
Office Hours: Thur 10:00am - 11:00am


Teaching Assistant

Zhi-Qin Yu (


Lecture Time and Venue

Wed 10:00 - 10:45 & 10:55 - 11:40
Fri 12:55 - 13:40 & 14:00 - 14:45
Rm 308, Rui-Qiu Chen Building(陈瑞球楼308)



[IIR]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.

The English reprint edition (英文影印版) can be bought through China-Pub. You can also download it from the book website.


Reference Books

[SE]: Bruce Croft, Donald Metzler, and Trevor Strohman. Search Engines: Information Retrieval in Practice. Addison Wesley, 2009.
(The English reprint edition can be bought through China-Pub.)

[WDM]: Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer, 2006.

[DM]: Jiawei Han, and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, Second Edition, 2006.
(The English reprint edition can be bought through China-Pub.)

[ESL]: Trevor Hastie, Robert Tibshirani, Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Second Edition, 2009.

[PRML]: Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.


Course Topics and Schedule (tentative)

(I acknowledge Christopher D. Manning for allowing me to use his slides, and to make some modifications if desired. The slides for Lec 1 are mainly adapted from those provided by Bruce Croft.)





Sep 12

Introduction: Web search overview, Web crawling and indexes

IIR Ch. 19 - 20
SE Ch. 1 - 3
Web Crawling (from Bing Liu)

Sep 14

Boolean retrieval

IIR Ch.1

Sep 19

The term vocabulary and postings lists

IIR Ch.2

Sep 21

Dictionaries and tolerant retrieval

IIR Ch.3

Sep 26

Index construction and compression

IIR Ch.4 - 5

Sep 28

Scoring, term weighting, and the vector space model

IIR Ch.6

Oct 10

Computing scores in a complete search system

IIR Ch.7

Oct 12

Evaluation and relevance feedback

IIR Ch.8 - 9

Oct 17

Probabilistic information retrieval

IIR Ch.11
Oct 19

Language models

IIR Ch.12
Oct 21
Form groups,and select a paper (a topic). Then send the group and paper information to TA. Deadline: 23:59pm
Oct 24
Matrix factorization and latent semantic indexing

IIR Ch.18

Oct 26
Link analysis: PageRank and HITS

IIR Ch.21

Oct 31

Supervised learning: classification

IIR Ch.13 - 15
Nov 02

Unsupervised learning: clustering

IIR Ch. 16 -17




data structure, design and analysis of algorithms, linear algebra, probability theory


Grading Scheme

1. In class quizzes (30%)

2. Homework (30%)

Homework 1
Homework 2
Homework 3

3. Project + presentation (40%)



Late Assignments

Assignments turned in late will be penalized 20% per late day.


Academic Honor Code

Honesty and integrity are central to the academic work. All your submitted assignments must be entirely your own (or your own group's). Any student found cheating or performing plagiarism will receive a final score of zero for this course.



Home Page