Course Information

CS 635: Information Retrieval & Mining for Hypertext & the Web

Traditional information retrieval, inverted indices, vector space model. Recall and precision. The Internet, Web, HTTP and HTML. Need for further research. Models for hypertext and semistructured data. Distinctions between data-centric and document-centric views.

Supervised learning for text. Kinds of classifiers, strengths and weeknesses. Feature selection, motivations, models, and algorithms. Bayesian learners. Modelling feature dependence. Exploiting topic hierarchies. Parameter estima- tion, smoothing and shrinkage. Maximum entropy and support-vector classifiers. Unsupervised learning for text. Definitions of clustering problems. Distance measures and their weaknesses. Top-down and bottom-up methods. Problems of high dimensionality. Projection-based speedup techniques. Probabilistic formulations, expectation maximization (EM) approach. Incorporating hyperlink information and user feedback. Semi-supervised learning. Application of EM. Exploiting hyperlink information for relaxation labelling and co-training. Performance implications. Reinforcement learning. Social network analysis. Citation indexing. Notions of prestige and centrality. Applications to the Web. Incorporating textual information and hypertext tag structure. Generalized proximity search. Models of evolution of social networks. Sampling and measurement techniques for the web. Resource discovery. Content-based goal-directed crawling. Topical locality and its use in resources discovery. Learning context graphs. Information extraction. Pattern matching vs. Probabilistic models. Markov models. Hierarchical models. Record segmentation. Use of dictionaries and lexical networks and knowledge bases for better accuracy.

Serge Abiteboul, Peter Buneman, Dan Suciu, Data on the Web : From Relations to Semistructured Data and XML, Morgan Kaufmann Publishers, october 1999.

Berthier Ribeiro-Neto, Ricardo Baeza-Yates, Modern Information Retrieval (ACM Press Series) Addison-Wesley Pub Co, May 1999.

Ian H. Witten, Alistair Moffat, Timothy C.Bell, Managing Gigabytes : Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers, May 1999.

Karen Sparck Jones (Editor), Peter Willet (Editor), Peter Willett(Editor), Readings in Information Retrieval, Morgan Kaufmann Publishers, July 1997.
Home Page


MA 212 or equivalent, CS 301 or equivalent, CS 317 or equivalent
Other Details

Duration : Full Semester Total Credit : 6
Type : Theory
Autumn Semester 2019-20

Status : Offered Instructor : Prof. Soumen Chakrabarti
Spring Semester 2019-20

Status : Not Offered Instructor : ---

Last Modified Date: 15-Jul-2013


Faculty CSE IT
Forgot Password
    [+] Sitemap     Feedback