Spring 2005

- Instructor
- Soumen Chakrabarti
- TA
- Vijay Krishnan
- Place
~~S9 CSE~~SIT 3rd floor classroom- Time
- Slot 8, but 8:45--10:15pm Tue, Wed
- Newsgroup
- iitb.courses.cs610

- Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Pearson Education.
- Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from hypertext data. Morgan-Kaufman.
- Grossman, D. A. and Frieder, O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer.
- Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufman.
- Slides from a Stanford course by Manning, Raghavan and Schutze.
- Material from ADFOCS 2004.

Keep watching the spreadsheet until grades are finalized. Sometimes this spreadsheet will be in an inconsistent state; don't panic (easily). There will be one midterm (20 marks), one quiz (20 marks), a final exam (40 marks), and a class project (20 marks). Regular students are expected to participate in all these evaluations. Each audit student has to either submit a project report or write the final exam.

Here are some rough guidelines for project evaluation (total, 20 marks):

- Clear understanding of the
**goals**of the project. (3 marks) - Adequate evidence of background
**reading**on the subject. (3 marks) - In case of an experimental project, completing the implementation of
a non-trivial and working
**system**. In case of a theoretical project, explaining to me the important**concepts**in the papers read. (5 marks) - Telling me something
**novel**that I don't know, in terms of either experimental or theoretical insight. (3 marks) - Quality of
**writeups**: slides and report with results and/or experience. (4 marks) - Regular contact and discussion about the project,
and
**timely**submission of results. (2 marks)

Check out some proposed projects. Ideally you should do them in teams of two (max three), and I will try to find enough projects to go around.

I hope this will help you while revising the material.
Extra classes are shown in color.
~~Canceled classes~~ are struck out. Lectures are 1.5h unless
mentioned otherwise. Some material to read are inlined in the
schedule, but there are others you can find on the Web.

- 2005-01-04 Tue 1h
- Course overview.

- 2005-01-07 Fri 1h
- IR basics
- Inverted indexes
- Dictionary and posting

- 2005-01-11 Tue
- Tokenization
- Stemming
- Positional indices and proximity queries
- Index compression and updates

- 2005-01-1? ???
- Boolean query processing
- Vector space model

- 2005-01-17 Mon
- Ranking in the vector space model
- Top-k ranking
- Min-hash and Jaccard

- 2005-01-18 Tue
- Shingling, sketches, fingerprints
- Find similar but not too similar
- Site-level mirror detection

~~2005-01-25 Tue~~- Canceled owing to LiveWire
- 2005-01-26 Tue
- Discussion of projects
- Supervised learning for text applications
- Clases, features, training, testing

- 2005-02-02 Wed
- Learning a classifier as optimizing loss
- Parametric models, differentiable loss functions
- Square loss and linear least square fit
- Overfitting, ridge penalty and lasso
- Binomial deviance loss and logistic regression
- Some partial reading material can be found here

- 2005-02-03 Thu
- Maximizing log likelihood
- Conditional and joint distributions
- Logistic regression
- Multivariate binary naive Bayes
- Parameter smoothing, Laplace correction

- 2005-02-08
- Guest lecture on spectral methods by Professor Ranade.

- 2005-02-09
- Guest lecture on spectral methods by Professor Ranade.

- 2005-02-17
- Linear discriminant
- Perceptron and margin maximization
- Locally weighted regression and kernels
- Generative model for unsupervised or semisupervised learning

- 2005-02-18
- The EM algorithm
- Mincut view of semisupervised learning
- Mincost coloring and relaxation labeling views of semisupervised learning
- A parametric semisupervised learning approach

- 2005-02-23
- 2005-03-01
- Limitations of basic multinomial mixture model.
- Inherent multi-topic nature of documents.
- Crude marginal distributions for naive models and how to improve them.

- 2005-03-02
- Complete the Teevan and Karger paper.
- Plate diagrams for multi-topic documents.
- Aspect model.
- Pitfalls of the aspect model (guest lecture by Vijay Krishnan).

- 2005-03-08
- Non-negative matrix factorization.
- Square loss and KL divergence as approximation objectives.
- Application to text clustering.
- Aspect model as non-negative factorization.
- Similarity of the aspect factorization to LSI.

- 2005-03-09
- Probabilistic constraints on the theme-document contribution matrix to prevent overfitting.
- Canny's Gamma-Possion (GaP) model.
- Basic definition of Blei and Jordan's latent Dirichlet model.

- 2005-03-15
- The Web as a text corpus plus many other features.
- Hyperlinks and other markups, sites, IP addresses, etc.
- Social network analysis and notions of prestige.
- HITS and Pagerank algorithms.

- 2005-03-16
- Recap of HITS vs. Pagerank, the connection between HITS and SVD.
- Definition and significance of the eigengap wrt convergence.
- Sensitivity of HITS to random erasures of nodes and edges (Ng, Zheng and Jordan).
- Sensitivity of HITS to common Web authoring artifacts.
- Some lecture notes.

- 2005-03-22
- Recap of HITS sensitivity/stability.
- Pagerank stability (Ng, Zheng, Jordan).
- Stochastic HITS and SALSA (Lempel and Moran)

- 2005-03-23
- Topic-sensitive pagerank (Haveliwala)
- Per-word pagerank (Richardson and Domingos)
- ObjectRank (Balmin et al.)

- 2005-03-29
- Fagin's rank-merging TA algorithm
- Top-k using TA and bounds
- Probabilistic improvement of bounds (Weikum et al.)
- Application to ObjectRank

- 2005-03-30
- Measuring decay on the Web using absorbing random walks
- Discounting pagerank for staleness of outlinks
- The preferential attachment model of Barabasi and Albert

- 2005-04-05
- The winners don't take all paper
- Quiz-1

- 2005-04-06
- Other properties of social networks
- Generating synthetic social networks---RMAT
- Sampling the Web graph

- 2005-04-12
- Some overview comments about record extraction, record field extraction, and entity resolution
- Course roundup

- 2005-04-13
- Review of quadratic optimization for Support Vector Machines
- Platt's sequential minimum optimization (SMO) algorithm
- Keerthy and DeCoste's fast primal SVM algorithm with square penalty on slack

- 2005-04-29