Tentative title:
Hypertext retrieval and mining

Format:
Graduate elective, would like to allow 4th year UG to take for credit
2--3h lecture per week, weekly lab assignments, project and final exam.

Prerequisites:
* UG probability and statistics (required)
* UG design and analysis of algorithms (required)
* UG databases (preferred)

Related courses:
* UG and PG AI, pattern recognition, machine learning
* PG Data Mining course in SIT

Tentative syllabus:

Traditional information retrieval, inverted indices, vector space
model.  Recall and precision.  The Internet, Web, HTTP and HTML.  Need
for further research.

Models for hypertext and semistructured data.  Distinctions between
data-centric and document-centric views.

Supervised learning for text.  Kinds of classifiers, strengths and
weaknesses.  Feature selection, motivations, models, and algorithms.
Bayesian learners.  Modeling feature dependence.  Exploiting topic
hierarchies.  Parameter estimation, smoothing and shrinkage.  Maximum
entropy and support-vector classifiers.

Unsupervised learning for text.  Definitions of clustering problems.
Distance measures and their weaknesses.  Top-down and bottom-up
methods.  Problems of high dimensionality.  Projection-based speedup
techniques.  Probabilistic formulations, expectation maximization (EM)
approach.  Incorporating hyperlink information and user feedback.

Semi-supervised learning.  Application of EM.  Exploiting hyperlink
information for relaxation labeling and co-training.  Performance
implications.  Reinforcement learning.

Social network analysis.  Citation indexing.  Notions of prestige and
centrality.  Applications to the Web.  Incorporating textual
information and hypertext tag structure.  Generalized proximity
search.  Models of evolution of social networks.  Sampling and
measurement techniques for the Web.

Resource discovery.  Content-based goal-directed crawling.  Topical
locality and its use in resource discovery.  Learning context graphs.

Information extraction.  Pattern matching vs. probabilistic models.
Markov models.  Hierarchical models.  Record segmentation.  Use of
dictionaries and lexical networks and knowledge bases for better
accuracy.

References:

Papers to be distributed.
Books below for occasional reference.

Data on the Web : From Relations to Semistructured Data and Xml
(Morgan Kaufmann Series in Data Management Systems)
Serge Abiteboul, Peter Buneman, Dan Suciu
Hardcover - 257 pages (October 1999)
Morgan Kaufmann Publishers; ISBN: 155860622X

Modern Information Retrieval (Acm Press Series)
Berthier Ribeiro-Neto, Ricardo Baeza-Yates
Textbook Binding - 513 pages 1 edition (May 1999)
Addison-Wesley Pub Co; ISBN: 020139829X

Managing Gigabytes : Compressing and Indexing Documents and Images
(Morgan Kaufmann Series in Multimedia Information and Systems)
Ian H. Witten, Alistair Moffat, Timothy C. Bell
Hardcover - 600 pages 2nd edition (May 1999)
Morgan Kaufmann Publishers; ISBN: 1558605703

Readings in Information Retrieval
(Morgan Kaufmann Series in Multimedia Information and Systems)
Karen Sparck Jones (Editor), Peter Willet
(Editor), Peter Willett (Editor),
Paperback - 600 pages (July 1997)
Morgan Kaufmann Publishers; ISBN: 1558604545