Web Information Retrieval Projects

Here are commentaries to some of my current projects with pointers to papers.

Relevance

How can we ask about the "speed of a jaguar" and not run into fine automobiles and football teams? Popular keyword search engines are only a beginning to harnessing the information in large hyperlinked text repositories. If we could embed large sections of the web in a structured directory, such as Yahoo!, searches can be constructed using not only keywords but also the topic paths induced by the directory. Another benefit of such automatic classification is that people can be characterized very compactly by how often they visit pages embedded in various nodes of the directory, and this "profile" can then be used for collaborative search and recommendation.

Classifying web documents turns out to be much more difficult than standard Information Retrieval benchmarks. To learn a domain as broad as the web, very many examples are needed. Existing classification engines cannot handle giga-byte sized corpora. Second, text alone is often deceptive, and the topic of a web page is often better assessed based on the link neighborhood of the page. I have built a fast, scalable hypertext classification engine called HyperClass. It uses efficient out-of-core data structures to deal with large corpora and a new algorithm for topical analysis of citations to achieve high speed and accuracy.

Popularity

Internet directories are popular not only because they are easier to search and navigate, but also because they hand-pick sites and pages of high quality. The field of bibliometry is concerned with the analysis of citation graphs, typically in academic publications. Jon Kleinberg designed a system called HITS for hyperlink citation analysis on the web. HITS assigns two scores of merit to web pages related to a topic: its hub score and authority score. A good hub is a useful resource to start browsing on a topic. A good authority is a well cited, popular page on the topic.

Web authorship is less regulated and more diverse than academic publications. Consequently, the simple model of web pages as nodes and hyperlinks as edges can be significantly improved upon. This page can be segmented into Information Retrieval and Parallel Computing; assigning a common score of merit would mislead the rating algorithm. I extended the HITS model so that query-dependent keywords near outlinks influence the notion of authority conferred from one page to another. The resulting automatic resource compilation system called Clever outperformed Yahoo! as judged by two user groups. This work has received some press recently.