\documentclass[pdftex]{article}
\usepackage{alltt,hyperref,color}
\textwidth 6.5in \textheight 9in
\oddsidemargin 0in \evensidemargin 0in
\topmargin 0in \headsep 0pt \headheight 0pt
\parskip .8ex plus .3ex
\renewcommand{\topfraction}{.9}
\renewcommand{\dbltopfraction}{.9}
\renewcommand{\bottomfraction}{.9}
\renewcommand{\textfraction}{.1}

\def\parent{\mathop{\smash{\rm parent}}\nolimits}
\def\good{\mathop{\smash{\rm good}}\nolimits}
%\def\exturl#1{\htmladdnormallink{#1}{#1}}
\def\exturl#1{\url{#1}}
\def\altavista{Alta\,Vista}
\def\yahoo{Yahoo!}
\def\apache{Apache}

\def\figpath{figure}
\long\def\todo#1{\textcolor{red}{#1}}

\begin{document}

\title{Automatic Web Resource Discovery}
\author{Soumen Chakrabarti\thanks{Email: \texttt{soumen@cs.berkeley.edu}}}
\date{IBM Almaden Research Center}
\maketitle



\section{Introduction}

Classical information retrieval (IR) is concerned with indexing a
collection of documents and answering queries by returning a ranked
list of relevant documents~\cite{FrakesB92,WordNet593,SaltonM83}.
With the growth of the web, the problems of ambiguity, context
sensitivity, synonymy and polysemy that are inherent in natural
languages, together with the \emph{abundance} of web pages related to
prominent topics, have exacerbated the difficulty of narrowing
down the user's information need.

%Stopword elimination and word morphing or stemming have been deviced
%to facilitate more robust matches~\cite{WordNet,Frakes}.  The user may
%also provide relevance feedback into the system after and initial
%query by marking the initial responses as good or bad; this can be
%used by the system to compute a more informed query
%response~\cite{RAT}. Although widely used for reference searches in libraries

%The emphasis on relevance, coverage and quality
%has never been as critical as on the web today.

Most search sites
%Following \yahoo, most portals such as \altavista, Excite, HotBot and others
have added directory-based topic browsing.  The web is organized as
a tree of topics, similar to the Dewey decimal
system, the Library of Congress catalog, or the
Patent and Trademarks Office subject codes.  Tree nodes are
maintained by ontologists, ranging from
paid college grads (\yahoo) to specialist volunteers (The Mining Co.).
Directory sites are very popular.  Because of human judgement,
pages placed in a topic directory tend to be not only relevant, but
also exemplary and influential.  However, human judgement is
slow, subjective and noisy.  It is labor-intensive to keep
these directories comprehensive.  The \yahoo{} strategy is
adequate for common users and shallow topics.  However, the
information needs of web-savvy users who are using the web for serious
research for extended periods are very hard to address.  The Mining
Co. strategy may be biased because of sparsity of experts; at
any rate it is biased away from the most accomplished and busiest
people.

%Some organizations such as \yahoo{} directly employ
%ontologists, whereas others like the Mining Co. and the Mozilla
%Directory Project solicits specialist volunteers to develop topics
%autonomously.  In either case, this is a labor-intensive process.



\section{Topic distillation}

The web is an example of a \emph{social network}.
Social networks have been extensively
researched~\cite{WassermanF93}.  Social networks are formed between
academics by co-authoring, advising, serving on committees; by movie
personnel by directing and acting; by musicians, football stars,
friends and relatives; infection transmitted by contact;
papers citing other papers, and web pages
hyperlinked to other web pages.
Social network theory is in part concerned with finding
influential nodes in the graph, representing important papers,
core of epidemics, etc.


\subsection{Social network analysis in IR}

IR literature includes insightful
studies of citation, co-citation, and influence of academic
publication~\cite{Larson96}.  Starting in 1996, a series of applications
of social network analysis were made to the web graph, with the
purpose of identifying the most authoritative sites and pages about a
user query.


\paragraph*{Google:}
If one wanders on the web for infinite time,
following a random link out of each page, then
%Imagine one had infinite time to browse the web.  Starting at some
%arbitrary page, one of the out-links is chosen at random to visit the
%next page.  Over an infinite number of link clicks,
different pages will be visited at different rates; popular pages with
many in-links will tend to be visited more often.  PageRank and
Google, invented by Brin and Page~\cite{BrinP98} crawl the
web and simulate such a random walk on the web graph in order to
estimate the visitation rate, which is used as a score of popularity.
Given a keyword query, matching documents are
ordered by this  score.


\paragraph*{HITS:}
Hyperlink induced topic search (HITS) \cite{Kleinberg98} is slightly
different: it does not crawl or pre-process the web, but depends on a
search engine.  A query to HITS is forwarded to \altavista, which
retrieves a subgraph of the web whose nodes (pages) match the query.
Pages citing or cited by these pages are also included.  This expanded
graph is analyzed for popular nodes using a procedure similar to
Google, the difference being that not one, but two scores emerge: the
measure of a page being an authority, and the measure of a page being
a \emph{hub} (a compilation of links to authorities).


\paragraph*{ARC and CLEVER:}
HITS's graph expansion sometimes leads to topic \emph{contamination}
or \emph{drift}.  E.g., the community of \emph{movie awards} pages on
the web is closely knit with highly cited (and to some extent
relevant) home pages of movie \emph{companies}.  Although \emph{movie
awards} is a finer topic than \emph{movies},
the top movie companies emerge as the victors upon running HITS.
This is partly because in HITS (and Google)
all edges in the graph have the same importance.
Contamination can be reduced by recognizing that hyperlinks that contain
\emph{award} or \emph{awards} near the anchor text are more relevant for
this query than other edges.  Such heuristic modification of edge
weights significantly improve the quality of query results.  In user
studies, the results compared favorably with lists compiled by humans,
such as \yahoo{} and Infoseek~\cite{Clever1}.


\paragraph*{Bharat and Henzinger:}
Another way to integrate textual content to avoid contamination of the
graph is to model each page according to the ``vector space''
model~\cite{SaltonM83}, and then prune the graph expansion at nodes
whose corresponding vectors are outliers with respect to the set of
vectors corresponding to documents directly retrieved from the search
engine~\cite{BharatH98}.  Impressive improvements in precision have
been observed.


\subsection{Discussion}

In principle, all the topic distillation systems discussed above can
handle ad-hoc queries, because of the dependence on a search engine in
the first step.  In practice, these systems produce good answers if
the query response induces a graph that is dense enough to derive
robust popularity scores.  Because it uses a precomputed page rank
independent of the query, Google is the fastest system.  Graph
construction at query time is involved in all the other methods.
Bharat and Henzinger describe several heuristics that cut down the
query time substantially.




\section{Portholes and focused crawling}

All the distillation systems depend on large, comprehensive web crawls
and indices.  The best crawlers running on heavy-duty servers can
cover 35--40\% of the web, currently over 340 million
pages~\cite{BharatB98}.  Googol has crawled about 60 million pages to
date.  There is a sobering maturity that size is not
everything~\cite{Gillmor98,Mirapaul98,LidskyS99}; seasoned users need deep,
specific \emph{portholes}, not shallow, generic \emph{portals}.  There
is a great need for topic-specific crawlers that
build focused libraries of web resources.  We will first discuss
why this cannot be done using topic distillation and then discuss a
new approach called \emph{focused crawling}.




\subsection{Exploiting topic distillation}

Topic distillation systems work well
for well-connected communities concerning broad topics.
%Many of the pages listed in the major web directories can be verified
%to already belong to large and dense subgraphs of the web, comprising
%many good hubs and authorities.
It is thus tempting to use a topic distillation system to generate a
web taxonomy in the following trivial way: with each node in
the taxonomy, associate a keyword query that describes the topic,
and run a distillation program.  While the simplicity is
appealing, it is not easy to make this succeed.

First, constructing the query involves trial and error, and a fair
amount of thought.  E.g., we needed the query \texttt{+"power suppl*"
"switch* mode" smps -multiprocessor* "uninterrupt* power suppl*" ups
-parcel} to do a good job for
\texttt{/Business\&Economy/Companies/Electronics/PowerSupplies}.
(\texttt{SMPS} stands for Switch Mode Power Supply, but also matches
pages containing \texttt{SMPs}, or Symmetric Multi-Processors!
Similarly with UPS and parcel.)  In a study with 966 nodes from
\yahoo, in order to match (in the opinion of blind testers) the
quality of \yahoo, queries had to be tuned by hand until the average
query had 7.03 terms and 4.34 operators (\texttt{"+-*}), in sharp
contrast to the average \altavista{} query having 2.35 words and only
0.41 operators.  These queries are not a one-time effort, because
inclusion of additional topic vocabulary, which may not be known a
priori, improved the results.  E.g., good results were obtained for
``European Airlines'' using the query \texttt{+lufthansa +iberia +klm}
(the fourth response from \altavista{} was itself a hub).

The second issue arises from the aforesaid susceptibility to
contamination from popular but irrelevant nodes.  The contamination
problem can be addressed in a few ad-hoc ways: stop-sites, term
weighting and edge weighting.  Stop-sites are nodes forcibly removed
from the graph before the iterations.  Query terms can be assigned
weights by humans to be used for better ranking.  The weights of links
incident with example pages, or lexically close to links to example
pages, can be increased artificially.  These fixes are ad-hoc; there
are no principles for setting edge weights guided by a precise model
of hyperlinkage.



\subsection{Learning from example}

The most successful way to combat contamination has been the use of
examples.  In 86\% of the test cases, specifying an example page
improved the results.

An example page offers more than just a forced node in the web graph
(as it was used above).  It has textual content, and is linked to
neighbors with more textual content.  In fact, extensive literature in
relevance feedback, automatic feature selection and classification
suggests that given examples, it is not even necessary to provide a
keyword query---the learning process will implicitly recognize the
important terms and compute the decision boundaries that can be used
to determine whether a given web page is relevant to a topic.

These decision boundaries can be derived implicitly and without any
human effort, given the example documents.  Most types of classifiers,
such as nearest neighbor (NN)~\cite{Swami},
bayesian~\cite{TAPER98,KollerS97}, support vector~\cite{SVM98}, are
capable of more reliable discrimination than keyword search, even if
boolean constructs were used.  (Boolean search induces hard and
brittle rules that tend to overfit, whereas well-designed learning
algorithms protect against overfitting.)

In the hypertext domain, it is
extremely important to build models and classifiers that take
link-based features into account.  The topic of a page influences its
text and the topics of pages in its neighborhood.  The latter
influence induces circularity, but this can be resolved by an
\emph{iterative relaxation} algorithm such as
HyperClass~\cite{HyperClass}.  Classification error reduced from 36\%
to 21\% in an experiment with US Patents.  With \yahoo, a more
dramatic reduction from 69\% to 20\% was observed.



\subsection{Guiding a focused crawler}

Provided a crawler is started off from connected examples of  topics,
it can be guided and scheduled by HyperClass.
The goal of a focused crawler is to
selectively seek out pages that are relevant to a pre-defined set of
\emph{topics}.  The topics are specified not using keywords, but using
exemplary documents that are analyzed by HyperClass.
The focused crawler analyzes its crawl boundary to find
the links that are likely to be most relevant for the crawl.  It
avoids irrelevant regions of the web.  This leads to significant
savings in hardware and network resources, and helps keep the crawl
more up-to-date.

Extensive focused-crawling experiments have been performed using
several topics at different levels of
specificity~\cite{Focus2}.  The system acquires relevant pages
steadily while standard crawling quickly loses its way, even though
they are started from the same root set.  Focused crawling is robust
against large perturbations in the starting set of URLs.  It discovers
largely overlapping sets of resources in spite of these perturbations.
In contrast with topic distillation systems, it is also capable of
exploring out and discovering valuable resources that are dozens of
links away from the start set, while carefully pruning the millions of
pages that may lie within this same radius.  These results suggest
that focused crawling is very effective for building high-quality
collections of web documents on specific topics, using modest desktop
hardware.


\begingroup \footnotesize
\bibliographystyle{abbrv}
\bibliography{acmcs}
\endgroup


\end{document}


problem spec and desirable features in the web domain
brief overview of main techniques-- k-nn, cosine, tfidf, bayes,
bayes net, neural net, support vector
popular wisdom and performance
hypertext and new challenges


relevance can be judged for each ad-hoc query, as in classical ir
role of topic directories and taxonomies
disambiguate queries
precompute frequent query answers
add intentional attributes to search
collect compact user profiles
targeted marketing
community formation



The initial euphoria about ``knowledge management'' is now all but
history, and there is evidence of realization that search technology
has been relatively stagnant in the last couple of years.  There is
also a sobering maturity that size isn't everything.

\begin{quotation}
Their biggest strength---massive size and reach---can also be a
drawback.  Even the best of the wide-ranging portals, such as Yahoo,
can't catalog every relevant site for every topic.
%Unless you're an
%expert at searching, even the best search engines, such as AltaVista,
%can overwhelm you with too many results that too often are irrelevant
%or out of date.
The most interesting trend is the growing sense of natural limits, a
recognition that covering a single galaxy can be more practical---and
useful---than trying to cover the entire universe.
%Sometimes, less really is more.
\begin{flushright}
Dan Gillmor \\
Technology Columnist, San Jose Mercury News
\end{flushright}

The emergence of portholes will be one of the major Internet trends of
1999.  As people become more savvy users of the Net, they want things
which are better focused on meeting their specific needs.  We're going
to see a whole lot more of this, and it's going to potentially erode
the user base of some of the big portals.

\begin{flushright}
Jim Hake\\
Founder, Global Information Infrastructure Awards
\end{flushright}

What's on the horizon?
... a continued proliferation of sites focused on individual needs.
\begin{flushright}
David Lidsky And Nancy Sirapyan \\
ZDNet
\end{flushright}

\end{quotation}

