In the second part we will discuss new issues that emerge as we move from text-oriented classical IR to distributed hypermedia. The measure of quality of a search engine has to be modified with hypermedia, since there are now two access paths into documents: searching and following links. Accordingly, algorithms have to be enhanced to better address the new quality measures. Similar issues arise in supervised and unsupervised learning. Citations also convey endorsement; this has been used recently by many researchers to distill out the most popular pages on a broad topic on the web.
In the last part we will discuss how all this ties in with databases. Three issues will be discussed. First, we will touch upon the representation problem: how to store and access documents and metadata. Second, we will give examples of how various text analyses routines can be implemented in a relational language. Third, we will discuss the efficiency and convenience of using a RDBMS for such purposes as against hand-crafted solutions. We will end the tutorial with a few case-studies.
Evaluation of relevance ranking systems is well-studied in IR. One starts from a corpus with queries for which the "correct" set of relevant documents have been chosen by hand. A given relevance ranking can be evaluated by considering the first k responses and measuring two fractions: the fraction of documents within the first k that are indeed relevant as per human judgement (called precision) and the fraction of actually relevant documents included in the first k responses (called recall). Recall and precision are inversely related, and the point where the recall equals precision is called the break-even point. These and other related measures are used to rate search engines under controlled corpora such as TREC and MEDLINE.
Latent semantic indexing (LSI) reduces such problems by embedding terms and documents in spaces derived by a linear transformation of the original vector space described above [DDLFH90]. The goal of the new representation is to pull two terms close to each other if they co-occur a lot in many similar documents, and to pull two documents close together if they share many similar terms. This definition is circular, but there is a fix-point that can be derived by computing the singular value decomposition (SVD) of the term-document matrix. Given a corpus, we perform the SVD and store documents in the new coordinate system. Given a query, we map it to the same coordinate system and perform a near-neighbor search. Folklore suggests that a few hundred dimensions typically suffice.
[ApteDW94] | C. Apte, F. Damerau and S. M Weiss.
Automated learning of decision rules for text categorization. SIGIR 1994. IBM Research Report RC 19481. |
[Asilomar98] | P. Bernstein, M. Brodie, S. Ceri, D. DeWitt, M. Franklin, H. Garcia-Molina,
J. Gray, J. Held, J. Hellerstein, H. V. Jagadish, M. Lesk, D. Maier, J.
Naughton, H. Pirahesh, M. Stonebraker, and J. Ullman.
The Asilomar report on database research. http://epoch.CS.Berkeley.EDU:8000/postgres/papers/Asilomar_Final.htm |
[Brint98] | WWW virtual library on knowledge management, edited by Y. Malhotra, at www.brint.com/km/. |
[Brown95] | E. Brown.
Execution Performance Issues in Full-Text Information Retrieval. PhD Thesis, University of Massachusetts at Amherst, Tech. Rep. 95-81, Oct. 1995. |
[CDI98] | S. Chakrabarti, B. Dom, and P. Indyk.
Enhanced hypertext categorization using hyperlinks. SIGMOD 1998. |
[CDGKRR97] | S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan and S.
Rajagopalan.
Automatic resource compilation by analyzing hyperlink structure and associated text. WWW7, 1998. |
[DDLFH90] | S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas and R. A.
Harshman.
Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6) 391--407, 1990. http://www.cs.utk.edu/~lsi/ |
[Dumais98] | S. Dumais. Using SVMs for text categorization. http://www.research.microsoft.com/~sdumais/ieee-tc98.doc)
Support Vector Learning, In: IEEE Intelligent Systems, July-August, 13(4), 1998. |
[Excite96] | Excite search engine at www.excite.com |
[FrakesB92] | W. B. Frakes and R. Baeza-Yates.
Information retrieval: Data structures and algorithms. Prentice-Hall, 1992. |
[Googol98] | Googol search engine at backrub.stanford.edu |
[Kleinberg98] | J. Kleinberg.
Finding authoritative sources in a hyperlinked environment. SODA 1998. |
[Larson96] | R. Larson. Bibliometrics of the World Wide Web: An Exploratory
Analysis of the Intellectual Structure of Cyberspace.
Proceedings of the 1996 Annual ASIS Meeting. http://sherlock.berkeley.edu/asis96/asis96.html |
[PirolliPR96] | P. Pirolli, J. Pitkow, R. Rao.
Silk from a sow's ear: Extracting usable structures from the Web. Proceedings of ACM SIGCHI Conference on Human Factors in Computing. 1996. http://www.acm.org:82/sigs/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html |
[SaltonM83] | G. Salton and M. J. McGill
Introduction to Modern Information Retrieval. McGraw-Hill, 1983. |
[Savoy97] | J. Savoy. Ranking Schemes in Hybrid Boolean Systems: A New Approach.
Journal of the American Society for Information Sciences, 48(3), 1997, pp. 235-253 |
[SM95] | U. Shardanand and P. Maes.
Social information filtering: Algorithms for automating word of mouth. Computer Human Interaction (CHI) 1995. |
[Small73] | Henry Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal of the American Society for Information Sciences 24, pp.265-269, Jul-Aug 1973. |
[WebSQL] | http://www.cs.toronto.edu/~websql/ |
[WeissVSNSG96] | R. Weiss, B. Velez, M. Sheldon, C. Nemprempre, P. Szilagyi, D.K. Giffor.
HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on Hypertext. 1996. |
[WordNet93] | WordNet: A Lexical Database for English
http://www.cogsci.princeton.edu/~wn/ |