Web Information Retrieval Projects
Here are commentaries to some of my current projects with pointers to papers.
Relevance
How can we ask about the "speed
of a jaguar" and not run into fine automobiles and football teams?
Popular keyword search engines are only a beginning to harnessing the information
in large hyperlinked text repositories. If we could embed large sections
of the web in a structured directory, such as Yahoo!,
searches can be constructed using not only keywords but also the topic
paths induced by the directory. Another benefit of such automatic
classification is that people can be characterized very compactly by how
often they visit pages embedded in various nodes of the directory, and
this "profile" can then be used for collaborative search and recommendation.
Classifying web documents turns out to be much more difficult than standard
Information Retrieval benchmarks.
To learn a domain as broad as the web, very many examples are needed.
Existing classification engines cannot handle giga-byte sized corpora.
Second, text alone is often deceptive, and the topic of a web page is often
better assessed based on the
link neighborhood
of the page. I have built a fast, scalable hypertext classification
engine called HyperClass. It uses efficient
out-of-core data structures to deal with large corpora and a
new
algorithm for topical analysis of citations to achieve high speed and
accuracy.
Popularity
Internet directories are popular not only because they are easier to search
and navigate, but also because they hand-pick sites and pages of high quality.
The field of bibliometry is concerned with the analysis of citation
graphs, typically in academic publications. Jon
Kleinberg designed a system called HITS for hyperlink citation analysis
on the web. HITS assigns two scores of merit to web pages related
to a topic: its hub score and authority score. A good
hub is a useful resource to start browsing on a topic. A good authority
is a well cited, popular page on the topic.
Web authorship is less regulated and more diverse than academic publications.
Consequently, the simple model of web pages as nodes and hyperlinks as
edges can be significantly improved upon. This page can be segmented into
Information Retrieval and Parallel Computing; assigning a common score
of merit would mislead the rating algorithm. I extended the HITS
model so that query-dependent keywords near outlinks influence the
notion of authority conferred from one page to another. The resulting
automatic resource
compilation system called Clever
outperformed Yahoo! as judged by two
user groups. This work has received some
press recently.