Database and Data Mining Issues in
Hypertext Information Management

Soumen Chakrabarti

IBM Almaden Research Center
San Jose, CA 95120
18 October 1998

Draft: Please do not distribute

1 Introduction

Semi-structured information management and analysis are assuming unprecedented importance as an integral part of database management technology.  Currently measured in billions of dollars, the worldwide internet community is expected to reach a trillion dollars by the year 2002.  However, with a few notable exceptions, the database research community has maintained some cautious distance from architecting systems for the acquisition, management and analysis of semi-structured and hypertextual data.  The goal of this tutorial is to expose database researchers to existing work in text and hypertext information retrieval (IR) systems and discuss emerging issues in the overlapping areas of information retrieval, data mining, and databases.  The tutorial is targeted towards database researchers interested in data analysis and data mining.  No background in information retrieval is assumed.

1.1 Motivation

This is an excellent time for the database community to get interested in hypertext and semi-structured data management, for many reasons.  Whereas the information systems community is very mature, the explosion in the volume of unstructured information raises performance and scalability issues that few other communities can appreciate [Asilomar98].  It is also time for architecting standards.  Given the great success of the database community with structured data storage and analysis, networked storage, access, and querying may greatly benefit from an increased interest of database researchers.

1.2 Outline of the tutorial

We will start with a brief review of classical IR systems, which include text indexing for keyword and phrase search, relevance ranking and relevance feedback.  Next we will review collaborative filtering and latent semantic indexing, two systems that exploit a measure of affinity between entities such as documents and people.  This will naturally lead to the study of more standard mining primitives, such as supervised learning (also called classification) and unsupervised learning (called clustering).

In the second part we will discuss new issues that emerge as we move from text-oriented classical IR to distributed hypermedia. The measure of quality of a search engine has to be modified with hypermedia, since there are now two access paths into documents: searching and following links.  Accordingly, algorithms have to be enhanced to better address the new quality measures.  Similar issues arise in supervised and unsupervised learning.  Citations also convey endorsement; this has been used recently by many researchers to distill out the most popular pages on a broad topic on the web.

In the last part we will discuss how all this ties in with databases.  Three  issues will be discussed.  First, we will touch upon the representation problem: how to store and access documents and metadata.  Second, we will give examples of how various text analyses routines can be implemented in a relational language.  Third, we will discuss the efficiency and convenience of using a RDBMS for such purposes as against hand-crafted solutions.   We will end the tutorial with a few case-studies.

2 Basic text search and filtering

When vendors shout "knowledge management" they in large part mean a program that scans through a set of documents (perhaps acquired via crawling) and constructs an index from each token or term to the set of documents that contain that term.  This can be used to answer keyword queries.

2.1 Keyword indices

The workhorse of text search is the so called "inverted index" data structure.  This takes a collection of documents and builds a mapping from each term in the collection to the set of documents containing the term, typically represented as a RID-list or a bitmap.  In its basic form, it enables boolean queries like socks AND network AND NOT shoes.  Most keyword indexing software go far beyond this.  A popular extension is phrase search, such as "inverted index", and NEAR queries, such as warehouse NEAR mining.  A very important aspect of an industry-grade indexing engine is the ability to maintain the index while documents are added, modified and deleted.  We will discuss the design of disk data structures that enable indexing at the rate of gigabytes per hour and support the above functionalities [Brown95].

2.2 Relevance ranking

Boolean queries, even after stemming and stopword elimination, run the risk of insufficient coverage or relevance.  This is especially true of relatively long "natural language" queries.  The frequency of the query terms in a document are valuable clues that can be used for ranking the query results.  As a simple example, terms that occur in almost all documents in the collection should be paid little attention, and more selective terms must be favored.  The time-honored IR technique to achieve this is a term weighting scheme called TFIDF, for "term frequency (times) inverse document frequency."  Under such a weighting scheme, any document or query can be represented in the so-called "vector space model."  In the vector space model there is one dimension for each term in the corpus, and documents and queries are points in this space.  Answering a query corresponds to finding near neighbors in a high-dimensional space [SaltonM83].

Evaluation of relevance ranking systems is well-studied in IR.  One starts from a corpus with queries for which the "correct" set of relevant documents have been chosen by hand.  A given relevance ranking can be evaluated by considering the first k responses and measuring two fractions: the fraction of documents within the first k that are indeed relevant as per human judgement (called precision) and the fraction of actually relevant documents included in the first k responses (called recall).  Recall and precision are inversely related, and the point where the recall equals precision is called the break-even point.  These and other related measures are used to rate search engines under controlled corpora such as TREC and MEDLINE.

2.3 Relevance feedback

In spite of the best efforts made by a search engine, the search keywords entered by the user may not accurately and adequately reflect the user's information need, and the need could change based on the initial response of the system.  In a relevance feedback system, the user enters judgement about the relevance of the initial responses.  This input modifies the ranking function on the fly, and the response is modified accordingly.  This iterative process is continued if it refines and improves the response.

2.4 Latent semantic indexing

Natural languages are ambiguous and informal.  Two major problems in keyword querying are synonymy and polysemy.  One may ask about a concept that can be represented by two or more different terms and phrases ("Where can I get my automobile repaired" vs. "what is a good place to get my car fixed").  On the other hand, a token may have multiple meanings and connotations (is "jaguar" the name of a car, a football team, or an animal).  The former leads to a recall problem; the latter leads to a precision problem.  One approach to handle the former problem is to use a thesaurus of related terms [WordNet93].  This has led to mixed results, failing when the query is accidentally made too broad.

Latent semantic indexing (LSI) reduces such problems by embedding terms and documents in spaces derived by a linear transformation of the original vector space described above [DDLFH90].  The goal of the new representation is to pull two terms close to each other if they co-occur a lot in many similar documents, and to pull two documents close together if they share many similar terms.  This definition is circular, but there is a fix-point that can be derived by computing the singular value decomposition (SVD) of the term-document matrix.  Given a corpus, we perform the SVD and store documents in the new coordinate system.  Given a query, we map it to the same coordinate system and perform a near-neighbor search.  Folklore suggests that a few hundred dimensions typically suffice.

2.5 Collaborative filtering

The affinity between terms and documents may be replaced by any other entities.  In collaborative filtering and recommendation systems, the goal is to discover similarity between users based on the commonality of documents read by them, and the similarity of documents based on the commonality of users reading them [SM95, GNOT92].  This can be used to form user communities with various benefits, e.g.: new material can be disseminated to users most likely to be interested.  As before, the relationship is circular, but in the linear framework, this can again be resolved as in LSI.  In practical systems a host of other techniques are used, as is the combination of content-based and collaborative filtering.

2.6 Supervised learning (classification)

Most web search engines also provide a tree-structured taxonomy of topics.  Populating such a taxonomy automatically has been a long-sought goal in IR.  Automatic text classifiers may be rule-based or statistical.  Some care is need in the selection of features and the indexing of statistics because of the large number of dimensions (assuming each term is a potential dimension).  The most accurate text classifiers known to date use a new technique called support vector machines [Dumais98].  Rule-based classifiers are next [ApteDW94].  However, very simple classifiers also do quite well, for example, naive Bayes classifier.  Typically, there is a steep trade-off between accuracy and training speed.  Applications of text classification include routing news articles, helpdesk problem reports, etc.

2.7 Unsupervised learning (clustering)

Clustering seeks to find structure, say, in the form of a taxonomy, starting from an amorphous jumble of documents.  Some notion of similarity between documents is a prerequisite.  Clustering algorithms try to assign similar documents to the same cluster and different ones to different clusters.  Clustering can provide a foundation for taxonomy construction and finding outlier documents. Various techniques have been used for clustering: k-means, expectation maximization, neural networks, etc.  One problem with unsupervised text clustering is that the notion of similarity is quite volatile; there is no unique correct answer to a text clustering problem, because depending on the application, terms may assume diverse measures of importance.   (The word can is usually treated as a stopword, but is it so in a document collection about waste management?)

3 Hypertext issues

The IR community reached broad consensus on workloads and notions of recall and precision---until the growth of hypermedia.  This has led to a critical re-inspection of the premises underlying many standard IR tasks.

3.1 Search

For non-hyperlinked documents, a keyword index is the only "access method" into the collection, and relatively precise notions of recall and precision can be developed.  With hyperlinked documents, there are two access paths: the keyword index (searching) and hyperlinks (browsing).  Both computing and scoring of query responses have to take this into account [Savoy97].

3.2 Supervised learning

Links give valuable information regarding the subject of a page that even the text on the page may not be able to do.  For example, the terms near anchors in HTML bear important information about the contents of the target URL.  Generalization of text-only learning algorithms result in significant boost in accuracy [CravenSN98, CDI98].

3.3 Unsupervised learning

Clustering a collection of homogeneous, self-contained news articles is quite different from discovering structure within a hetergeneous collection of web pages.  Here again, as in supervised learning, hyperlinks significantly help in identifying aggregate structure from the web.  Various models of document affinity, based on combinations of terms and links, have been proposed [PirolliPR96, WeissVSNSG96].

3.4 Endorsement through citation

In academic journals or a populist hypermedia such as the web, citations reflect not only a connection in terms of relevance of content, but also represent the conferring of quality or authority [Small73, Larson96].  This idea has been developed into several systems for scoring web pages per popularity [Googol98, Kleinberg98].

4 Database issues

We will study how relational databases can be used to write many of the functions of text repositories we have reviewed so far.  Aspects to be discussed:

5 Case studies

We will study the functionality of a few commercial text storage, indexing, and analysis systems, discussing, where known, the architecture and functionality.

6 References

[ApteDW94] C. Apte, F. Damerau and S. M Weiss. 
Automated learning of decision rules for text categorization. 
SIGIR 1994.  IBM Research Report RC 19481.
[Asilomar98] P. Bernstein, M. Brodie, S. Ceri, D. DeWitt, M. Franklin, H. Garcia-Molina, J. Gray, J. Held, J. Hellerstein, H. V. Jagadish, M. Lesk, D. Maier, J. Naughton, H. Pirahesh, M. Stonebraker, and J. Ullman. 
The Asilomar report on database research. 
http://epoch.CS.Berkeley.EDU:8000/postgres/papers/Asilomar_Final.htm
[Brint98] WWW virtual library on knowledge management, edited by Y. Malhotra, at www.brint.com/km/.
[Brown95] E. Brown. 
Execution Performance Issues in Full-Text Information Retrieval. 
PhD Thesis, University of Massachusetts at Amherst, Tech. Rep. 95-81, Oct. 1995.
[CDI98] S. Chakrabarti, B. Dom, and P. Indyk. 
Enhanced hypertext categorization using hyperlinks.   SIGMOD 1998.
[CDGKRR97] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan and S. Rajagopalan. 
Automatic resource compilation by analyzing hyperlink structure and associated text.  WWW7, 1998.
[DDLFH90] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas and R. A. Harshman. 
Indexing by latent semantic analysis. 
Journal of the Society for Information Science, 41(6) 391--407, 1990. 
http://www.cs.utk.edu/~lsi/
[Dumais98] S. Dumais.  Using SVMs for text categorization.  http://www.research.microsoft.com/~sdumais/ieee-tc98.doc
Support Vector Learning, In: IEEE Intelligent Systems, July-August, 13(4), 1998.
[Excite96] Excite search engine at www.excite.com
[FrakesB92] W. B. Frakes and R. Baeza-Yates. 
Information retrieval: Data structures and algorithms.  Prentice-Hall, 1992.
[Googol98] Googol search engine at backrub.stanford.edu
[Kleinberg98] J. Kleinberg. 
Finding authoritative sources in a hyperlinked environment.  SODA 1998.
[Larson96] R. Larson.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace.
Proceedings of the 1996 Annual ASIS Meeting.
http://sherlock.berkeley.edu/asis96/asis96.html
[PirolliPR96] P. Pirolli, J. Pitkow, R. Rao.
Silk from a sow's ear: Extracting usable structures from the Web.
Proceedings of ACM SIGCHI Conference on Human Factors in Computing. 1996.
http://www.acm.org:82/sigs/sigchi/chi96/proceedings/papers/Pirolli_2/pp2.html 
[SaltonM83] G. Salton and M. J. McGill 
Introduction to Modern Information Retrieval.  McGraw-Hill, 1983.
[Savoy97] J. Savoy.  Ranking Schemes in Hybrid Boolean Systems: A New Approach. 
Journal of the American Society for Information Sciences, 48(3), 1997, pp. 235-253
[SM95] U. Shardanand and P. Maes. 
Social information filtering: Algorithms for automating word of mouth. 
Computer Human Interaction (CHI) 1995.
[Small73] Henry Small. Co-citation in the scientific literature: a new measure of the relationship between two documents.  Journal of the American Society for Information Sciences 24, pp.265-269, Jul-Aug 1973.
[WebSQL] http://www.cs.toronto.edu/~websql/
[WeissVSNSG96] R. Weiss, B. Velez, M. Sheldon, C. Nemprempre, P. Szilagyi, D.K. Giffor.
HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on Hypertext. 1996. 
[WordNet93] WordNet: A Lexical Database for English 
http://www.cogsci.princeton.edu/~wn/