CS610 Hypertext retrieval and mining
- Soumen Chakrabarti
- Vijay Krishnan
S9 CSE SIT 3rd floor classroom
- Slot 8, but 8:45--10:15pm Tue, Wed
Reference books and other course material
- Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information
Retrieval. Pearson Education.
- Chakrabarti, S. (2002). Mining the Web: Discovering knowledge from
hypertext data. Morgan-Kaufman.
- Grossman, D. A. and Frieder, O. (1998). Information Retrieval:
Algorithms and Heuristics. Kluwer.
- Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing
Gigabytes: Compressing and Indexing Documents and
- Slides from a Stanford course
by Manning, Raghavan and Schutze.
- Material from ADFOCS 2004.
Keep watching the spreadsheet until grades are
finalized. Sometimes this spreadsheet will be in an inconsistent
state; don't panic (easily). There will be one midterm (20 marks),
one quiz (20 marks), a final exam (40 marks), and a class project (20
marks). Regular students are expected to participate in all these
evaluations. Each audit student has to either submit a project report
or write the final exam.
Here are some rough guidelines for project evaluation
(total, 20 marks):
- Clear understanding of the goals of the project.
- Adequate evidence of background reading
on the subject. (3 marks)
- In case of an experimental project, completing the implementation of
a non-trivial and working system. In case of a theoretical project,
explaining to me the important concepts
in the papers read. (5 marks)
- Telling me something novel that I don't know, in terms of
either experimental or theoretical insight. (3 marks)
- Quality of writeups: slides and report with results and/or
experience. (4 marks)
- Regular contact and discussion about the project,
and timely submission of results. (2 marks)
Check out some proposed projects.
Ideally you should do them in teams of two (max three), and I will try to
find enough projects to go around.
I hope this will help you while revising the material.
are shown in color.
Canceled classes are struck out. Lectures are 1.5h unless
mentioned otherwise. Some material to read are inlined in the
schedule, but there are others you can find on the Web.
- 2005-01-04 Tue 1h
- 2005-01-07 Fri 1h
- IR basics
- Inverted indexes
- Dictionary and posting
- 2005-01-11 Tue
- Positional indices and proximity queries
- Index compression and updates
- 2005-01-1? ???
- Boolean query processing
- Vector space model
- 2005-01-17 Mon
- Ranking in the vector space model
- Top-k ranking
- Min-hash and Jaccard
- 2005-01-18 Tue
- Canceled owing to LiveWire
- 2005-01-26 Tue
- Discussion of projects
- Supervised learning for text applications
- Clases, features, training, testing
- 2005-02-02 Wed
- Learning a classifier as optimizing loss
- Parametric models, differentiable loss functions
- Square loss and linear least square fit
- Overfitting, ridge penalty and lasso
- Binomial deviance loss and logistic regression
- Some partial reading material can be found here
- 2005-02-03 Thu
- Maximizing log likelihood
- Conditional and joint distributions
- Logistic regression
- Multivariate binary naive Bayes
- Parameter smoothing, Laplace correction
- Linear discriminant
- Perceptron and margin maximization
- Locally weighted regression and kernels
- Generative model for unsupervised or semisupervised learning
- The EM algorithm
view of semisupervised learning
- Mincost coloring and relaxation labeling views of semisupervised
parametric semisupervised learning approach
- Limitations of basic multinomial mixture model.
- Inherent multi-topic nature of documents.
- Crude marginal distributions for naive models
and how to improve them.
- Complete the
and Karger paper.
- Plate diagrams for multi-topic documents.
- Aspect model.
- Pitfalls of the aspect model (guest lecture by Vijay Krishnan).
- Probabilistic constraints on the theme-document contribution
matrix to prevent overfitting.
- Canny's Gamma-Possion (GaP) model.
- Basic definition of Blei and Jordan's latent Dirichlet model.
- The Web as a text corpus plus many other features.
- Hyperlinks and other markups, sites, IP addresses, etc.
- Social network analysis and notions of prestige.
- HITS and Pagerank algorithms.
- Recap of HITS vs. Pagerank, the connection between HITS and SVD.
- Definition and significance of the eigengap wrt convergence.
- Sensitivity of HITS to random erasures of nodes and edges
(Ng, Zheng and Jordan).
- Sensitivity of HITS to common Web authoring artifacts.
- Recap of HITS sensitivity/stability.
- Pagerank stability (Ng, Zheng, Jordan).
- Stochastic HITS and SALSA (Lempel and Moran)
- Topic-sensitive pagerank (Haveliwala)
- Per-word pagerank (Richardson and Domingos)
- ObjectRank (Balmin et al.)
- Fagin's rank-merging TA algorithm
- Top-k using TA and bounds
- Probabilistic improvement of bounds (Weikum et al.)
- Application to ObjectRank
- Other properties of social networks
- Generating synthetic social networks---RMAT
- Sampling the Web graph
- Some overview comments about record extraction, record field
extraction, and entity resolution
- Course roundup