Using graphs in unstructured and semistructured data mining
(Annotated reading list for ADFOCS 2004)

Soumen Chakrabarti

Under construction; I will add links as we go along. Each citation has a public URL which may end in a dead link over time, so I have tried to add a link to a private copy as well; obviously if only this page is copied over somewhere, the private links will break.

Measuring real-life graphs

The 3xFaloutsos paper was an early study, starting off many power-law papers.

The bow-tie paper characterized large scale structures in the Web graph.

Estimates of the diameter of the Web graph.

Proposing and analyzing generative models

If you are (very) mathematically inclined here are two superhubs to check: a course page by Alan Frieze, and another one by David Aldous.

Barabasi and others wrote the initial preferential attachment/rich-gets-richer/power law degree distribution with power=3; this is kind of inflexible. Meanwhile Web data has power closer to 2.

Kumar and others proposed and analyzed the model of copying links from a refernce node, and offered the scope for smooth tuning of the power.

Pennock and others proposed a linear splice between preferential attachment and uniform attachment, also giving a way to tune the power to observed data.

Compressing large graphs

Adler and Mitzenmacher give a simple algorithm for compressing graphs using the reference idea, and many negative (hardness) results for improving upon it.

Nevertheless, a group in Italy has written a great Java package called WebGraph to compress Web graphs down to only about 2.58 bits per link.

Software for generating and visualizing graphs

Sampling from the Web graph

Henzinger and others discuss a practical way of sampling from the Web graph without using backlinks, although this means there can be some initial bias.

Bar-Yossef and others use a backlink and self-loop trick to turn the Web graph into an undirected, regular graph, so that we can sample directly from a walk on the transformed graph.

We applied a Bar-Yossef-type algorithm to study broad topic communities on the Web.

Epidemics spreading on graphs

Kephart and White analyze some special cases of virus-spreading in Erdos-Renyi graphs.

Wang and others give a general survivability analyses for arbitrary graphs.

Pagerank and enhancements

The unpublished "Maxwell's equation" paper from Page, Brin and others.

Tomlin suggested that Pagerank is only solution to a larger class of constrained flow problems.

A number of followup papers/projects biased the jump and/or walk steps to treat pages differently.

Richardson and Domingos proposed the "intelligent surfer" (as against the "random surfer").

Haveliwala avoids the overhead of a per-word pagerank computation by proposing a coarsening to a topic-specific pagerank.

Bar-Yossef et al. proposed an absorbing random walk to score a node for freshness/staleness of pages.

Conductance/resistance-based graph search

There's also been a proliferation of papers which apply Pagerank-style ideas for doing keyword searches in graphs derived from relational or XML data sources.

ObjectRank uses ideas very similar to Richardson and Domingos, but on (semi-)structured data graphs rather than the Web graph.

Faloutsos, McCurley


HITS vs. Pagerank, stability

Randomized variants of HITS

Markov labeling and applications