Software and Utilities
Watch this space for our upcoming release of various useful
software modules for Web and data mining research. In particular,
the following are available on request under something like
- NetRank and
HubRank, companion packages for
learning to rank networked entities and indexing for and fast
execution of graph conductance queries.
NetRank and HubRank project page for details.
- IR4QA, a system for type-aware
text indexing and search that lets you ask queries of the form
"find an instance of mammal near the tokens sleep
and feet" or "find an instance of distance
near tokens Paris and Rome". See our
paper for details.
- iVia Nalanda,
a simple, configurable, highly scalable crawler written in ANSI
C++. Unlike w3c-libwww, we have taken care to overlap DNS lookups
with HTTP transfers and handle multiple DNS servers. We typically run
the crawler with three caching DNS servers and 50--100 concurrent HTTP
fetches on a dedicated 1Mbps line; we believe we can easily saturate a
5Mbps line with this code. You write an application by extending the
base crawler class and plugging in various virtual methods.
See our WWW 2002
paper. (Joint work with Kunal Punera and the iVia team.)
Update: you can download the code from
the iVia Web site.
- HyParSuite, a Hypertext
This is a small, efficient, reliable hypertext
cleaner, parser and DOM-tree generator, written in ANSI C++. A unique
feature of HyParSuite is that you can configure the cleaning rules by
setting up a simple table of rules. In our recent research we have
found HyParSuite to be more usable overall compared to
libxml or Tidy. See our paper
in the Data Engineering Bulletin, page 34.
(Joint work with Ravindra Jaju.)
- DEPOT, acronym for
Distributed Ensemble of Pages that is Outage Tolerant.
DEPOT is public-domain software that gives the abstraction of
a distributed RAID on dirt-cheap PCs for HTML pages that crawlers
need to store compressed for future processing.
Written by Jeetendra Mirchandani under my guidance.
(Simple Iterated Multiple
Projection on Lines): a new classifier suitable for text
and possibly other sparse, high-dimensional data. SIMPL uses only
sequential scans over the training data, takes time comparable to
naive-Bayes classifiers, but approaches the accuracy of linear
Support Vector Machines for many text classification benchmarks.
See our VLDB 2002 paper on SIMPL,
our binary download, and
A source download will follow.
(Joint work with Shourya Roy and Mahesh Soundalgekar.)
- Bits and pieces of the Memex system,
see our demo
in VLDB 2000.