Focused crawling: The quest for topic-specific portalsTWO FORCES are shaping the future of the web away from generic portals to specific portholes --- search and directory sites that are targeted to web-mature audiences interested in deep, high-quality information on specific subjects, sites that trawl for the needles and leave the haystacks alone. 

One force is the exploding volume of web publication.  The major web crawlers harness dozens of powerful processors and hundreds of gigabytes of storage using superbly crafted software, and yet cover 30-40% of the web.  Scaling up the operation may be feasible, but useless.  Already, the sheer diversity of content at a generic search site snares all but the most crafty queries in irrelevant results. 

The second force is the growing mass of savvy users who use the web for serious research.  Although keyword search and directory browsing are still essential services, these are no longer sufficient for their needs.  They need tools for ad-hoc search, profile-based background search, find-similar search, classification, and clustering of dynamic views of a narrow section of the web. 

There is growing consensus that small is beautiful: as the web grows, a large number of portals, each of which specializes deeply in a topical area, will be preferable to universal, one-size-fits-all search portals.  The following recent articles bear testimony to this imminent trend. 

Jim Hake, founder of the annual Global Information Infrastructure Awards to Web sites, said the emergence of portholes will be one of the major Internet trends of 1999. 

Can focused portals be constructed automatically?

In this project I am investigating how to build a focused portal automatically, starting from a handful of examples on a specific topic, while minimizing crawling time and space in irrelevant and/or low-quality regions of the web.  I envisage that crawling will become a lightweight activity at the level of individuals and interest groups.  There will be hundreds of thousands of focused crawlers covering diverse areas of web information, tailored to the needs of specific communities. 
The prototype focused crawler that I have built consists of two text mining systems that guide the crawler.  One is an automatic hypertext topic classifier that stamps each document with a score of relevance.  The relevance score reflects how interesting the page is, given the topic of the crawl.  The other continually updates a resource rating.  This rating is an estimated benefit of crawling out from the page's out-links.
The main challenge is to ensure a high harvest rate: the fraction of page fetches which are relevant to the user's interest. Without the hypertext classifier, an ordinary crawler has no hope of keeping on track. We started an ordinary crawler from a set of 50 bicycling pages, and within the first few hundred page fetches, it was completely lost in web terrain having nothing to do with bicycling. The quickly decaying plot of relevance against time shows that on the web, harvesting relevant content is non-trivial.
It is crucial that the harvest rate of the focused crawler be high, otherwise it would be easier to crawl the whole web and bucket the results into topics as a post-processing step.  To our delight, our prototype, starting from the same 50 URLs, kept up a healthy harvest rate, collecting relevant pages almost half the time.  Within one hour on a small desktop PC, we trawled over 3000 pages relevant to bicycling.  We did not need any dependence on a general crawl and index of the web, or an existing search service.
The resource rating system runs concurrently with the crawler to identify a few strategic nodes in the web graph from which a large number of relevant sites can be reached within a few links.  The reader is encouraged to check out the top choices for bicycling: 
http://www.truesport.com/Bike/links.htm
http://reality.sgi.com/billh_hampton/jrvs/links.html
http://www.acs.ucalgary.ca/~bentley/mark_links.html
http://www.cascade.org/links.html
http://www.bikeride.com/links/road_racing.asp
http://www.htcomp.net/gladu/'drome/
http://www.tbra.org/links.shtml
Keyword searches could very easily miss these great pages.  The starting set of URLs were those that Alta Vista returned in response to the query "bicycling" and all pages that were one link from the query responses.  In spite of this distance-one expansion, it was found that most of the top 100 strategic nodes were quite far from the starting set, up to 12 links away.  An ordinary crawler would encounter millions of pages within this radius, most of them not about bicycling!  The graph also proves that the start set was not very favorable, and the focused crawler had to do non-trivial work to distill the above sites.
Similar results were obtained across diverse topics such as AIDS/HIV, mutual funds, gardening, and pharmaceuticals, to name a few.

Papers and sites related to the Focus project

  • How can surfing, searching and bookmarking actions of a community of users feed input naturally to a focused crawling system? This is the goal of the Memex project.
  • The first paper describing the system and our experiences with it. This paper appeared in the WWW8 conference in 1999.
  • Another paper describing the details of implementing the focused crawler using off-the-shelf databases and storage managers appeared in VLDB 1999.