# Using graphs in unstructured and semistructured data mining (Annotated reading list for ADFOCS 2004)

## Soumen Chakrabarti

Under construction; I will add links as we go along. Each citation has a public URL which may end in a dead link over time, so I have tried to add a link to a private copy as well; obviously if only this page is copied over somewhere, the private links will break.

## Measuring real-life graphs

The 3xFaloutsos paper was an early study, starting off many power-law papers.

The bow-tie paper characterized large scale structures in the Web graph.

Estimates of the diameter of the Web graph.

## Proposing and analyzing generative models

If you are (very) mathematically inclined here are two superhubs to check: a course page by Alan Frieze, and another one by David Aldous.

Barabasi and others wrote the initial preferential attachment/rich-gets-richer/power law degree distribution with power=3; this is kind of inflexible. Meanwhile Web data has power closer to 2.

Kumar and others proposed and analyzed the model of copying links from a refernce node, and offered the scope for smooth tuning of the power.

Pennock and others proposed a linear splice between preferential attachment and uniform attachment, also giving a way to tune the power to observed data.

## Compressing large graphs

Adler and Mitzenmacher give a simple algorithm for compressing graphs using the reference idea, and many negative (hardness) results for improving upon it.

Nevertheless, a group in Italy has written a great Java package called WebGraph to compress Web graphs down to only about 2.58 bits per link.

## Sampling from the Web graph

Henzinger and others discuss a practical way of sampling from the Web graph without using backlinks, although this means there can be some initial bias.

Bar-Yossef and others use a backlink and self-loop trick to turn the Web graph into an undirected, regular graph, so that we can sample directly from a walk on the transformed graph.

We applied a Bar-Yossef-type algorithm to study broad topic communities on the Web.

Kephart and White analyze some special cases of virus-spreading in Erdos-Renyi graphs.

Wang and others give a general survivability analyses for arbitrary graphs.

## Pagerank and enhancements

The unpublished "Maxwell's equation" paper from Page, Brin and others.

Tomlin suggested that Pagerank is only solution to a larger class of constrained flow problems.

A number of followup papers/projects biased the jump and/or walk steps to treat pages differently.

Richardson and Domingos proposed the "intelligent surfer" (as against the "random surfer").

Haveliwala avoids the overhead of a per-word pagerank computation by proposing a coarsening to a topic-specific pagerank.

Bar-Yossef et al. proposed an absorbing random walk to score a node for freshness/staleness of pages.

## Conductance/resistance-based graph search

There's also been a proliferation of papers which apply Pagerank-style ideas for doing keyword searches in graphs derived from relational or XML data sources.

ObjectRank uses ideas very similar to Richardson and Domingos, but on (semi-)structured data graphs rather than the Web graph.

Faloutsos, McCurley