Software and Utilities

Links to useful systems:

The following local projects are ancient, but may be available on request under something like GPL or LGPL.

NetRank and HubRank, companion packages for learning to rank networked entities and indexing for and fast execution of graph conductance queries. See the NetRank and HubRank project page for details.
IR4QA, a system for type-aware text indexing and search that lets you ask queries of the form "find an instance of mammal near the tokens sleep and feet" or "find an instance of distance near tokens Paris and Rome". See our WWW 2006 paper for details.
iVia Nalanda, a simple, configurable, highly scalable crawler written in ANSI C++. Unlike w3c-libwww, we have taken care to overlap DNS lookups with HTTP transfers and handle multiple DNS servers. We typically run the crawler with three caching DNS servers and 50--100 concurrent HTTP fetches on a dedicated 1Mbps line; we believe we can easily saturate a 5Mbps line with this code. You write an application by extending the base crawler class and plugging in various virtual methods. See our WWW 2002 paper. (Joint work with Kunal Punera and the iVia team.) Update: you can download the code from the iVia Web site.
HyParSuite, a Hypertext Parsing Suite. This is a small, efficient, reliable hypertext cleaner, parser and DOM-tree generator, written in ANSI C++. A unique feature of HyParSuite is that you can configure the cleaning rules by setting up a simple table of rules. In our recent research we have found HyParSuite to be more usable overall compared to libxml or Tidy. See our paper in the Data Engineering Bulletin, page 34. (Joint work with Ravindra Jaju.)
DEPOT, acronym for Distributed Ensemble of Pages that is Outage Tolerant. DEPOT is public-domain software that gives the abstraction of a distributed RAID on dirt-cheap PCs for HTML pages that crawlers need to store compressed for future processing. Written by Jeetendra Mirchandani under my guidance.
SIMPL (Simple Iterated Multiple Projection on Lines): a new classifier suitable for text and possibly other sparse, high-dimensional data. SIMPL uses only sequential scans over the training data, takes time comparable to naive-Bayes classifiers, but approaches the accuracy of linear Support Vector Machines for many text classification benchmarks. See our VLDB 2002 paper on SIMPL, our binary download, and sample data. A source download will follow. (Joint work with Shourya Roy and Mahesh Soundalgekar.)
Bits and pieces of the Memex system, see our demo in VLDB 2000.