Mining the Web: Additional readings

Soumen Chakrabarti

Here I will post comments and additional readings organized by chapters in the book, or propose new sections and chapters.

Chapter 1, Introduction

General additional reading:

Chapter 2, Crawling and monitoring the Web

Additional open-source crawlers: Archive.org's crawler, UbiCrawler. Also see this list of crawlers written in Java.

The first edition has no discussion of maintaining crawls and keeping them fresh. These papers should be discussed:

Chapter 3, Indexing and search

Chapter 4, Similarity and clustering

Chapter 5, Supervised learning from feature vectors

Chapter 6, Semi-supervised learning Learning graphical models

Chapter 7, The Web as an evolving Social network analysis

The revamped chapter will deal mostly with phenomenological measurements on the Web graph and proposed models, rather than procedures to do things.

Chapter 8, Resource discovery Algorithms for analyzing the Web graph

Algorithmic work like Pagerank, HITS, CLEVER, focused crawling, etc. will all move to this chapter.

Chapter 9, The future of Web mining Language processing/ankle-deep semantics