Hypertext retrieval and mining
(CS-610, Spring 2003)

Staff

Instructor: Soumen Chakrabarti
TA's: ~~TBD~~none

Time and place

Lectures: F12/CSE Tu 3:30--5, Th 6--7:30
There will also be a newsgroup iitb.courses.cse610 which you should actively read.

Syllabus

The first part of the course will draw from this book. Once we cover sufficient ground we will read several papers written in the more recent past. A collection of these papers will be posted on an IIT internal site.

Resources

NEW! Instructors can access a password protected area by sending an email to to get a password. This area will contain class notes, lecture slides, and past exams.
WEKA
, assisted by Roger Menezes.
Half-baked notes on .

Evaluation (tentative plan)

5	Homeworks
5	Presenting papers
10	Short project
10	Midterm exam
15	Quiz
15	Long project
40	Final exam

Homeworks

Lecture calendar

I hope this will help you while revising the material. Extra classes are shown in color. ~~Canceled classes~~ are struck out. Thanks to Vijay Krishnan for chronicling some of these.

2003-01-02

Course outline
Introduction to machine learning for hypertext and applications

2003-01-07

Networking basics
HTTP and HTML, hyperlinks
Introduction to crawling
DNS lookup and prefetching
Spider traps and how to avoid

2003-01-09

Hash functions, MD5
Duplicate page elimination
Jaccard coefficient, approximate similarity, shingles
Web monitoring, detecting changes in the web

2003-01-14

Adhoc search, inverted index, boolean queries, stopwords
Recall and precision, average and interpolated precision
Documents and queries as vectors in TFIDF space

2003-01-15

Rare and frequent terms, review of TFIDF
Personalization, relevance feedback, 2-shot ranking
Indices for boolean and positional search
Encoding and compressing indices: binary, unary, Golomb codes
Ping pong protocol for index refreshing
Issues in incremental index maintenance

2003-01-21 (Aru)

XML/graph search
"Information Unit" paper, steiner trees
Introduction to BANKS

2003-01-23 (Aru)

More about BANKS and keyword search in graphs

2003-01-28 (Aru/Rushi)

Ranking using biased random walks
General tools for searching RDBMS, XML, HTML, text

2003-01-31

Similarity joins via matrix multiplication
Fast approximate matrix multiplication (Cohen and Lewis)
Introduction to clustering

2003-02-06

Hierarchical agglomerative clustering, dendrograms.
Top-down clustering algorithms: k-means, Kohonen maps.
Overview of the Cohen and Lewis paper.

2003-02-11

Kohonen maps.
Fastmap.
Cluster-preserving projections.

2003-02-13

Generative models for documents.
Multivariate binary and multinomial models.
Vector space model.

2003-02-25

Midterm solutions.
Generative models, continued.
Mixture models.

2003-03-04

Expectation maximization (EM) algorithm.
Proof of convergence of EM.

2003-03-06

Generative multi-topic document models.
Multiple cause mixture models.
Aspect model.

2003-03-11

Aspect model, continued.
Parameter estimation for the aspect model using EM.
Folding-in, modeling noise.

2003-03-12

Latent semantic indexing.
Intro to supervised learning (classification).

2003-03-13

Evaluation of classifiers: recall, precision, micro- and macro-average, F1, breakeven.
Nearest neighbor classifier using TFIDF representation.
Improved performance via pre-clustering.

2003-03-20

Generative vs. discriminative classification.
Binary vs. multinomial naive Bayes classification.
Using small-degree Bayesian networks.

2003-03-25

Feature design.
Feature selection methods: accumulate/drop.
Two measures for measuring predicting power of features.

Copyright statement

The copyrights to documents and software referred to herein remain with the respective copyright holders. The copyright to this specific course organization, course handouts, and exams is owned by IIT Bombay.

Hypertext retrieval and mining(CS-610, Spring 2003)