Hypertext retrieval and
(CS-610, Spring 2003)
Time and place
- Lectures: F12/CSE Tu 3:30--5, Th 6--7:30
- There will also be a newsgroup
iitb.courses.cse610 which you should actively read.
The first part of the course will draw from
Once we cover sufficient ground we will read several
papers written in the more recent past. A collection
of these papers will be posted on an IIT internal site.
Instructors can access a
protected area by sending an email to
get a password. This area will contain class notes,
lecture slides, and past exams.
- , assisted by Roger Menezes.
- Half-baked notes on .
Evaluation (tentative plan)
I hope this will help you while revising the material.
are shown in color.
Canceled classes are struck out.
Thanks to Vijay Krishnan for chronicling some of these.
- Course outline
- Introduction to machine learning for hypertext and applications
- Networking basics
- HTTP and HTML, hyperlinks
- Introduction to crawling
- DNS lookup and prefetching
- Spider traps and how to avoid
- Hash functions, MD5
- Duplicate page elimination
- Jaccard coefficient, approximate similarity, shingles
- Web monitoring, detecting changes in the web
- Adhoc search, inverted index, boolean queries, stopwords
- Recall and precision, average and interpolated precision
- Documents and queries as vectors in TFIDF space
- Rare and frequent terms, review of TFIDF
- Personalization, relevance feedback, 2-shot ranking
- Indices for boolean and positional search
- Encoding and compressing indices: binary, unary, Golomb codes
- Ping pong protocol for index refreshing
- Issues in incremental index maintenance
- 2003-01-21 (Aru)
- XML/graph search
- "Information Unit" paper, steiner trees
- Introduction to BANKS
- 2003-01-23 (Aru)
- More about BANKS and keyword search in graphs
- 2003-01-28 (Aru/Rushi)
- Ranking using biased random walks
- General tools for searching RDBMS, XML, HTML, text
- Similarity joins via matrix multiplication
- Fast approximate matrix multiplication (Cohen and Lewis)
- Introduction to clustering
- Hierarchical agglomerative clustering, dendrograms.
- Top-down clustering algorithms: k-means, Kohonen maps.
- Overview of the Cohen and Lewis paper.
- Kohonen maps.
- Cluster-preserving projections.
- Generative models for documents.
- Multivariate binary and multinomial models.
- Vector space model.
- Midterm solutions.
- Generative models, continued.
- Mixture models.
- Expectation maximization (EM) algorithm.
- Proof of convergence of EM.
- Generative multi-topic document models.
- Multiple cause mixture models.
- Aspect model.
- Aspect model, continued.
- Parameter estimation for the aspect model using EM.
- Folding-in, modeling noise.
- Latent semantic indexing.
- Intro to supervised learning (classification).
- Evaluation of classifiers:
recall, precision, micro- and macro-average, F1, breakeven.
- Nearest neighbor classifier using TFIDF representation.
- Improved performance via pre-clustering.
- Generative vs. discriminative classification.
- Binary vs. multinomial naive Bayes classification.
- Using small-degree Bayesian networks.
- Feature design.
- Feature selection methods: accumulate/drop.
- Two measures for measuring predicting power of features.
The copyrights to documents and software referred to herein remain
with the respective copyright holders. The copyright to this specific
course organization, course handouts, and exams is owned by IIT