Article detail

Resource Center:

Linux

Home/Home Office

Convergence

Enterprise

E-Biz

Search in Archive CD Search

Home

About Us

Site Map

RQS

Shopping

Travel

Feedback

Help

Find a Job

Get Free IT Info

Recommend this site

Home > Technology > New Technologies for the Web

This Week's Review

Linux

Hands On

More Reviews

Hardware

Software

Tech Trends

E-Commerce

Coding
Editor's Column

Business Computing

January, 2003 issue

New Technologies for the Web

Focused crawlers give accurate results by specializing in one or few topics, while Memex-type browsers give information on the basis of past surfing experiences

Wednesday, November 29, 2000

This article is the concluding piece of the series on Web-information management. The first two articles in the series were on the technologies that powered the first-generation search engines and how the second-generation search engines exploit the social-network analysis for effective mining of relevant information. In this article we will talk about focused crawling that promises to contribute to our information-foraging endeavors. We will also look at another technology, Memex, that lets you use your past surfing experiences to search for relevant information on the Web.

How focused crawling works

Focused crawling concentrates on the quality of information and the ease of navigation as against the sheer quantity of the content on the Web. A focused crawler seeks, acquires, indexes, and maintains pages on a specific set of topics that represent a relatively narrow segment of the Web. Thus, a distributed team of focused crawlers, each specializing in one or a few topics, can manage the entire content of the Web.

Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. Focused crawlers selectively seek out pages that are relevant to a pre-defined set of topics. These pages will result in a personalized web within the World Wide Web. Topics are specified to the console of the focus system using exemplary documents and pages (instead of keywords).

Such a way of functioning results in significant savings in hardware and network resources, and yet achieves respectable coverage at a rapid rate, simply because there is relatively little to do. Each focused crawler is far more nimble in detecting changes to pages within its focus than a crawler that crawls the entire Web.

The crawler is built upon two hypertext mining programs—a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links.

What focused crawlers can do

Here is what we found when we used focused crawling for many varied topics at different levels of specificity.

Focused crawling acquires relevant pages steadily while standard crawling (like the ones used in first-generation search engines) quickly loses its way, even though they start from the same root set.
Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations.
It can discover valuable resources that are dozens of links away from the start set, and at the same time carefully prune the millions of pages that may lie within this same radius. The result is a very effective solution for building high-quality collections of Web documents on specific topics, using modest desktop hardware.
Focused crawlers impose sufficient topical structure on the Web. As a result, apart from the naïve topical search, powerful semi-structured query, analysis, and discovery are also enabled.
Getting isolated pages, rather than comprehensive sites, is a common problem with Web search. With focused crawlers, you can order sites according to the density of relevant pages found there. For example, you can find the top five sites specializing in mountain biking.
A focused crawler also detects cases of competition. For instance, it will take into account that the homepage of a particular auto-manufacturing company like Honda, is unlikely to contain a link to the homepage of its competitor, say, Toyota.
Focused crawlers also identify regions of the Web that grow or change dramatically as against those that are relatively stable.

The ability of focused crawlers to focus on a topical sub-graph of the Web and to browse communities within that sub-graph will lead to significantly improved Web resource discovery. On the other hand, the one-size-fits-all philosophy of other search engines, like AltaVista and Inktomi, means that they try to cater to every possible query that might be made on the Web. Although such services are invaluable for their broad coverage, the resulting diversity of content is often of little relevance or quality.

Page(s): 1 2

For the PCQuest print publication: [ Magazine Subscription ] [ Contact Info ] [ PCQuest Team ] [ Advertise ]

Other Cyber Media web sites

[ Dataquest ]	[ Voice&Data ]	[ CIOL ]	[ Computers@Home ]
[ DQ Channels India ]	[ IDC India ]	[ Training ]	[ CIOL Shop ]
[ the DQweek ]	[ CIOL Jobs ]	[ Cyberexpo ]	[ Cyber Multimedia ]
[ Cyber Media India ]	[ GlobalOutsourcing ]	[ Travel ]	[ Cyber Astro ]

[ Missing Issue ]

Copyright © CMIL. All rights reserved.
Reproduction in whole or in part in any form or medium without express written permission is prohibited.
Usage of this web site is subject to terms and conditions.
Broken links? Problems with site? Send email to webmaster@ciol.com