Mining the Web: Additional readings

Here I will post comments and additional readings organized by chapters in the book, or propose new sections and chapters.

Chapter 1, Introduction

General additional reading:

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. Pearson Education.
Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan-Kaufman.
Manning, C., and Schutze, H. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
Grossman, D. A. and Frieder, O. (1998). Information Retrieval: Algorithms and Heuristics. Kluwer.

Additional open-source crawlers: Archive.org's crawler, UbiCrawler. Also see this list of crawlers written in Java.

The first edition has no discussion of maintaining crawls and keeping them fresh. These papers should be discussed:

Blelloch papers on index compression via document and term ID assignment
Text search support in XML: ELIXIR, XIRQL, XRank, TeXQuery.
Text search support in graphs: ObjectRank, BANKS. (Problem: need to introduce pagerank before this treatment.)
Text search support in relational data: BANKS, DISCOVER, DBXplorer.
Fast update scenarios in hybrid text and relational data, e.g., e-commerce
Some comments on Lucene
Savoy paper

Corpus models: mixture, aspect, latent dirichlet, GaP
Simpler treatment of EM, more discussion on pitfalls
NMF with square loss and divergence loss.
Non negative matrix factorization for document clustering. Also see this paper.
Canny's GaP model.
The overfitting and identifiability problems of the aspect model as well as dirichlet models need to be demonstrated
The canopies technique
A notable new paper to add is the Dhillon et al. paper on entropy-based co-clustering.
S. Z. Selim, M. A. Ismail, K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-6, No. 1, 1984. (proof of local convergence of k-means).

More basic material on regression
Classification as parametric models estimated by optimizing loss.
Allocate more space for logistic regression, maxent and SVMs. Also include maxent with box/inequality constraints.
Several papers coauthored by Yiming Yang make comprehensive comparative assessments of different text classifiers: Comparing the regularization term on SVM, LR, etc., and Modified/regularized logistic regression
Move the semisupervised EM (CMU paper) to this chapter. Add lots of sample experimental results for EM, show that joint prob often has little control over loss. Get into CEM a bit to show what's non-trivial.

Move Bayesian networks to this chapter. Point out why the query is easy in case of regular document classification. Discuss priors and parameter uncertainty at some length, Singh and Greiner. Also discuss binary, ternary etc. vs. multinomial and length issues.
Min-cut and metric labeling formulations for semi-supervised learning.
Markov random fields, relaxation labeling, and HyperClass
Conditional graphical models and applications in IE and NLP. Assorted material on conditional models: from Stanford
PRM, MRN, etc.

The revamped chapter will deal mostly with phenomenological measurements on the Web graph and proposed models, rather than procedures to do things.

More space for winners don't take all paper
Pandurangan, Raghavan and Upfal paper on pagerank distribution under preferential attachment.
Copying models
Googlearchy models
Spamming models?
Blogs

Algorithmic work like Pagerank, HITS, CLEVER, focused crawling, etc. will all move to this chapter.

Storing and compressing the Web graph
Stability and spammability of HITS and Pagerank; the Ng-Zheng-Jordan paper.
Domingos and Richardson paper Network value of customers.
Topic sensitive pagerank including Combining link and content information in Web seearch
How to handle dangling nodes (IBM paper in WWW 2004).