Soumen Chakrabarti
Here I will post comments and additional readings organized by
chapters in the book, or propose new sections and chapters.
Chapter 1, Introduction
General additional reading:
- Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information
Retrieval. Pearson Education.
- Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing
Gigabytes: Compressing and Indexing Documents and
Images. Morgan-Kaufman.
- Manning, C., and Schutze, H.
Foundations of
Statistical Natural Language Processing.
MIT Press, 1999.
- Grossman, D. A. and Frieder, O. (1998). Information Retrieval:
Algorithms and Heuristics. Kluwer.
Chapter 2, Crawling and monitoring the Web
Additional open-source crawlers:
Archive.org's crawler,
UbiCrawler.
Also see this list of crawlers written in Java.
The first edition has no discussion of maintaining
crawls and keeping them fresh. These papers should be discussed:
- Brewington and Cybenko
- Squillante et al.
- Tomlin et al.
- Cho, Olston, Pandey, Ntoulas
Chapter 3, Indexing and search
- Blelloch papers on index compression via document and term ID
assignment
- Text search support in XML: ELIXIR, XIRQL, XRank, TeXQuery.
- Text search support in graphs: ObjectRank, BANKS.
(Problem: need to introduce pagerank before this treatment.)
- Text search support in relational data: BANKS, DISCOVER, DBXplorer.
- Fast update scenarios in hybrid text and relational data, e.g.,
e-commerce
- Some comments on Lucene
- Savoy paper
Chapter 4, Similarity and clustering
- Corpus models: mixture, aspect, latent dirichlet, GaP
- Simpler treatment of EM, more discussion on pitfalls
- NMF with square loss and divergence loss.
- Non
negative matrix factorization for document clustering.
Also see this paper.
- Canny's GaP model.
- The overfitting and identifiability problems of the aspect model
as well as dirichlet models need to be demonstrated
- The canopies technique
- A notable new paper to add is the Dhillon et al. paper on
entropy-based co-clustering.
- S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A
Generalized Convergence Theorem and Characterization of
Local Optimality, IEEE Trans. Pattern Analysis and
Machine Intelligence, Vol. PAMI-6, No. 1, 1984.
(proof of local convergence of k-means).
Chapter 5, Supervised learning from feature vectors
- More basic material on regression
- Classification as parametric models estimated by optimizing loss.
- Allocate more space for logistic regression,
maxent
and SVMs. Also include maxent with box/inequality constraints.
- Several papers coauthored by Yiming Yang make comprehensive
comparative assessments of different text classifiers:
Comparing the regularization term on SVM, LR, etc., and
Modified/regularized logistic regression
- Move the semisupervised EM (CMU paper) to this chapter. Add lots
of sample experimental results for EM, show that joint prob often has
little control over loss. Get into CEM a bit to show what's
non-trivial.
Chapter 6, Semi-supervised learning
Learning graphical models
- Move Bayesian networks to this chapter. Point out why the query
is easy in case of regular document classification. Discuss priors
and parameter uncertainty at some length, Singh and Greiner. Also
discuss binary, ternary etc. vs. multinomial and length issues.
- Min-cut and metric labeling formulations for semi-supervised
learning.
- Markov random fields, relaxation labeling, and HyperClass
- Conditional graphical models and applications in IE and NLP.
Assorted material on conditional models: from
Stanford
- PRM, MRN, etc.
Chapter 7, The Web as an evolving Social network
analysis
The revamped chapter will deal mostly with phenomenological
measurements on the Web graph and proposed models, rather than
procedures to do things.
- More space for winners don't take all paper
- Pandurangan, Raghavan and Upfal paper on pagerank distribution
under preferential attachment.
- Copying models
- Googlearchy models
- Spamming models?
- Blogs
Chapter 8, Resource discovery
Algorithms for analyzing the Web graph
Algorithmic work like
Pagerank, HITS, CLEVER, focused crawling, etc. will all move to this
chapter.
Chapter 9, The future of Web mining
Language processing/ankle-deep semantics
- Chunking,
POS tagging, IE, and shallow NLP
- NLP resources:
FrameNet,
WordNet,
Penn treebank,
SensEval2,
OpenNLP.
- WSD and graphical models, pseudowords for WSD
- Language modeling and applications
- Question answering
- Bootstrapping ontologies, PMI