Tentative title: Hypertext retrieval and mining Format: Graduate elective, would like to allow 4th year UG to take for credit 2--3h lecture per week, weekly lab assignments, project and final exam. Prerequisites: * UG probability and statistics (required) * UG design and analysis of algorithms (required) * UG databases (preferred) Related courses: * UG and PG AI, pattern recognition, machine learning * PG Data Mining course in SIT Tentative syllabus: Traditional information retrieval, inverted indices, vector space model. Recall and precision. The Internet, Web, HTTP and HTML. Need for further research. Models for hypertext and semistructured data. Distinctions between data-centric and document-centric views. Supervised learning for text. Kinds of classifiers, strengths and weaknesses. Feature selection, motivations, models, and algorithms. Bayesian learners. Modeling feature dependence. Exploiting topic hierarchies. Parameter estimation, smoothing and shrinkage. Maximum entropy and support-vector classifiers. Unsupervised learning for text. Definitions of clustering problems. Distance measures and their weaknesses. Top-down and bottom-up methods. Problems of high dimensionality. Projection-based speedup techniques. Probabilistic formulations, expectation maximization (EM) approach. Incorporating hyperlink information and user feedback. Semi-supervised learning. Application of EM. Exploiting hyperlink information for relaxation labeling and co-training. Performance implications. Reinforcement learning. Social network analysis. Citation indexing. Notions of prestige and centrality. Applications to the Web. Incorporating textual information and hypertext tag structure. Generalized proximity search. Models of evolution of social networks. Sampling and measurement techniques for the Web. Resource discovery. Content-based goal-directed crawling. Topical locality and its use in resource discovery. Learning context graphs. Information extraction. Pattern matching vs. probabilistic models. Markov models. Hierarchical models. Record segmentation. Use of dictionaries and lexical networks and knowledge bases for better accuracy. References: Papers to be distributed. Books below for occasional reference. Data on the Web : From Relations to Semistructured Data and Xml (Morgan Kaufmann Series in Data Management Systems) Serge Abiteboul, Peter Buneman, Dan Suciu Hardcover - 257 pages (October 1999) Morgan Kaufmann Publishers; ISBN: 155860622X Modern Information Retrieval (Acm Press Series) Berthier Ribeiro-Neto, Ricardo Baeza-Yates Textbook Binding - 513 pages 1 edition (May 1999) Addison-Wesley Pub Co; ISBN: 020139829X Managing Gigabytes : Compressing and Indexing Documents and Images (Morgan Kaufmann Series in Multimedia Information and Systems) Ian H. Witten, Alistair Moffat, Timothy C. Bell Hardcover - 600 pages 2nd edition (May 1999) Morgan Kaufmann Publishers; ISBN: 1558605703 Readings in Information Retrieval (Morgan Kaufmann Series in Multimedia Information and Systems) Karen Sparck Jones (Editor), Peter Willet (Editor), Peter Willett (Editor), Paperback - 600 pages (July 1997) Morgan Kaufmann Publishers; ISBN: 1558604545