This article is the concluding piece of the series on
Web-information management. The first two articles in the series were on the
technologies that powered the first-generation search engines and how the
second-generation search engines exploit the social-network analysis for
effective mining of relevant information. In this article we will talk about
focused crawling that promises to contribute to our information-foraging
endeavors. We will also look at another technology, Memex, that lets you use
your past surfing experiences to search for relevant information on the Web.
How focused crawling works
Focused crawling concentrates on the quality of information
and the ease of navigation as against the sheer quantity of the content on the
Web. A focused crawler seeks, acquires, indexes, and maintains pages on a
specific set of topics that represent a relatively narrow segment of the Web.
Thus, a distributed team of focused crawlers, each specializing in one or a few
topics, can manage the entire content of the Web.
Rather than collecting and indexing all accessible Web
documents to be able to answer all possible ad-hoc queries, a focused crawler
analyzes its crawl boundary to find the links that are likely to be most
relevant for the crawl, and avoids irrelevant regions of the Web. Focused
crawlers selectively seek out pages that are relevant to a pre-defined set of
topics. These pages will result in a personalized web within the World Wide Web.
Topics are specified to the console of the focus system using exemplary
documents and pages (instead of keywords).
Such a way of functioning results in significant savings in
hardware and network resources, and yet achieves respectable coverage at a rapid
rate, simply because there is relatively little to do. Each focused crawler is
far more nimble in detecting changes to pages within its focus than a crawler
that crawls the entire Web.
The crawler is built upon two hypertext mining programs—a
classifier that evaluates the relevance of a hypertext document with respect to
the focus topics, and a distiller that identifies hypertext nodes that are great
access points to many relevant pages within a few links.
What focused crawlers can do
Here is what we found when we used focused crawling for many
varied topics at different levels of specificity.
-
Focused crawling
acquires relevant pages steadily while standard crawling (like the ones used
in first-generation search engines) quickly loses its way, even though they
start from the same root set.
-
Focused crawling
is robust against large perturbations in the starting set of URLs. It
discovers largely overlapping sets of resources in spite of these
perturbations.
-
It can discover
valuable resources that are dozens of links away from the start set, and at
the same time carefully prune the millions of pages that may lie within this
same radius. The result is a very effective solution for building
high-quality collections of Web documents on specific topics, using modest
desktop hardware.
-
Focused crawlers
impose sufficient topical structure on the Web. As a result, apart from the
naïve topical search, powerful semi-structured query, analysis, and
discovery are also enabled.
-
Getting isolated
pages, rather than comprehensive sites, is a common problem with Web search.
With focused crawlers, you can order sites according to the density of
relevant pages found there. For example, you can find the top five sites
specializing in mountain biking.
-
A focused crawler
also detects cases of competition. For instance, it will take into account
that the homepage of a particular auto-manufacturing company like Honda, is
unlikely to contain a link to the homepage of its competitor, say, Toyota.
-
Focused crawlers
also identify regions of the Web that grow or change dramatically as against
those that are relatively stable.
The ability of focused crawlers to focus on a topical
sub-graph of the Web and to browse communities within that sub-graph will lead
to significantly improved Web resource discovery. On the other hand, the
one-size-fits-all philosophy of other search engines, like AltaVista and Inktomi,
means that they try to cater to every possible query that might be made on the
Web. Although such services are invaluable for their broad coverage, the
resulting diversity of content is often of little relevance or quality. Page(s):
1
2
|