I am SOUMEN CHAKRABARTI, anagram for ANARCHISM
OUTBREAK, a faculty member in the Department of Computer Science.
If you are from industry looking for
consultation, please read the section
titled Consultative practice rules and norms (1996)
and my informal notes.
If you are looking to join
CSE@IITB as a
PhD scholar, please
read about the PhD
Qualifier model being adopted by the department,
and contact the department office directly.
PhD admissions is centrally coordinated at the department
I do not offer short term projects
or summer internships to students not enrolled at IIT
Bombay. Such emails will be discarded.
If you are an IIT student looking for a
project or seminar within the scope of
your program (Btech, DD, Mtech) please read
these guidelines first.
You can check my calendar for
free slots and, if you have permission,
propose a meeting here
or by email.
The best way to contact me is to send mail to
(please note that I am on a
low-spam diet). Please use
only email to initiate a conversation with me if we
haven't communicated before. Only in case of an emergency, you can
call me at +91-22-2576-7716 or fax me at +91-22-2572-0022. If you are
visiting, here are directions to my
Education and career
Don Bosco School,
Park Circus, Calcutta, 1975--1987
Indian Institute of Technology,
- University of California,
Research Center, 1996--1999
- IIT Bombay,
University, Spring 2004
- Searching the annotated Web with entities, types and relations
- We are building CSAW, a new search system
that integrates type and role annotations with keyword matches,
thereby exploiting lexical ontologies and entity taggers. Supported
by Yahoo!, HP Labs, Google, Microsoft, SAP and NetApp.
- Graph conductance search
- Rich connections between random walks, graph eigensystems, and
electrical networks make it attractive to apply them for ranking
nodes. PageRank is a prominent example of the paradigm. In PageRank,
the edge weights are fixed and we have to compute steady state
probabilities of nodes. What if we have
something like the opposite problem?
And how to make this fast at query time?
and Microsoft (2007, 2008).
- Integrating IR with databases
- In the BANKS project,
we proposed new paradigms of keyword search in graphs that can
represent text embedded in relational or XML-like data.
- The effect of search engines on the Web graph and page popularity
- Search engines are influenced by the (in)degree of Web pages, but
their ranked lists modulate page popularity and eventually their
(in)degree, setting up a feedback to some degree. Might the evolution
of the Web graph be influenced substantially by the existence of
search engines? Is there a need to regulate monopolies? What are
healthy economic objectives, and how to optimize them?
- Focused crawlers to build topic-specific portals
- A focused crawler collects a topic-specific
subgraph of the Web by coupling classifiers and reinforcement learners
with crawlers. An open-source focused crawler project was started at
the Lab. for Intelligent
Internet Research and is available.
- Mining hypertext to estimate topics and popularity
- I built a hypertext
classifier that uses the text in and links around a given Web
page to label it with a topic. This was an early application of
Markov networks to Web analysis. As a member of the
Project, I worked on
to analyze the links around a web page and the text in pages that
cite the given page to assign it a measure of popularity.
- Compiling and running parallel scientific programs
- In a previous life, my PhD thesis
was on the design and implementation of compilers and
systems for distributed memory multiprocessors.
Seems like distributed parallel computing is hot again,
thanks to "Big Data"!
- Journal editorship
- Conference organization
- CIKM 2014, area char for text and Web data mining.
- EMNLP 2013,
area chair for information retrieval and question answering.
- WWW 2013, track chair for
search, systems and applications.
- SIGIR 2011,
area chair for Web IR and social media search.
- WWW 2010, program co-chair with
2010, senior PC member.
- Web Search APIs:
The Next Generation --- A panel discussion at
- SIGIR 2009,
Area Chair, Machine Learning for IR.
- WSDM 2008 ("wisdom"),
Program Co-chair with
- VLDB 2007,
- ECML-PKDD 2006,
Area Chair, Track for mining links, graphs, trees and
- WWW 2006,
Deputy Chair, Data Mining track.
- COMAD 2005b,
Associate Program Chair.
- WWW 2003,
Vice Chair, Searching and Mining track.
- ICDE 2003.
Vice Chair, Data, Text and Web Mining track.
- WWW 2002, Deputy Chair,
Searching, Querying and Indexing
- Conference committee/reviewing
WSDM 2014 (senior PC),
SIGKDD 2013 (senior PC),
WSDM 2013 (senior PC and
SIGKDD 2012 (senior PC),
(PC and invited applications talks committee),
WSDM 2009 (senior PC),
SIGKDD 2008 (senior PC),
SIGIR 2008 (senior PC),
SIGKDD 2006 (senior PC),
VLDB 2003 (IIS),
- Web Search and Data Mining (WSDM) steering committee member, 2008--2013.
- ACM SIGKDD
Curriculum Committee Member.
But the power of instruction is seldom of much efficacy,
except in those happy dispositions where it is
The Decline And Fall Of The Roman Empire
Volume 1, Chapter 4.
- Web Search and Mining has been expanded to a two-semester
sequence, shorthanded WMa (Autumn) and WMb (Spring). WMa retains the
old course code, but has been planned from scratch. WMb will be
largely about information extraction and integration, and querying
over semistructured and graphical data representations.
WMa Autumn 2009,
WMb Spring 2010,
WMa Autumn 2010,
WMb Spring 2011,
WMa Autumn 2011,
WMa Spring 2013,
WMa Autumn 2013,
WMb Spring 2014.
- Statistical Foundations of Machine Learning:
Autumn 2005, Autumn 2006, Autumn 2007, Autumn 2008.
- Web Search and Mining (earlier called
Information Retrieval and Mining for Hypertext and the Web):
- Undergraduate Programming Languages,
- Computer programming and utilization aka
- Graduate Software Lab:
... your work is to keep cranking the flywheel that turns the gears
that spin the belt in the engine of belief that keeps you and your desk
The Writing Life.
- Quantity Queries on Web Tables: Annotation, Response and Consensus Models. With Sunita Sarawagi. SIGKDD 2014.
- Discriminative Link Prediction using Local Links, Node Features and Community Structure.
With Abir De and Niloy Ganguly. ICDM 2013.
Bootstrapping of Corpus Annotations and Entity Types.
With Siddhanth Jain and Hrushikesh Mohapatra.
- Web-scale Entity Annotation Using MapReduce.
With Shashank Gupta and Varun Chandramouli.
- Learning Joint Query Interpretation
and Response Ranking. With Uma Sawant.
- Compressed Data Structures for Annotated
Web Search. With
Sasidhar Kasturi, Bharath Balakrishnan,
Ganesh Ramakrishnan, and Rohit Saraf. WWW 2012.
- Diversity in ranking via
resistive graph centers.
With Avinava Dubey and Chiru Bhattacharyya.
SIGKDD 2011. (Source code
is available, contact Avinava Dubey for usage details.)
- SCAD: Collective Discovery of Attribute Values.
With Anton Bakalov, Ariel Fuxman, and Partha Talukdar.
- Index Design and Query Processing for Graph Conductance Search.
With Amit Pathak and Manish Gupta.
VLDB Journal, 2010.
- Annotating and Searching Web Tables Using Entities, Types and
Relationships. With Girija Limaye and Sunita Sarawagi.
- Conditional Models for
Non-smooth Ranking Loss Functions.
With Avinava Dubey, Jinesh Machchhar, and Chiru Bhattacharyya.
ICDM 2009, Miami.
- Learning to rank for
quantity consensus queries.
With Somnath Banerjee and Ganesh Ramakrishnan.
SIGIR 2009, Boston.
- Collective annotation of
Wikipedia entities in Web text.
With Sayali Kulkarni, Amit Singh and Ganesh Ramakrishnan.
SIGKDD 2009, Paris.
- Text search enhanced with types and entities. Chapter in
Text Mining: Theory, Application, and Visualization,
Srivastava and Sahami, eds., 2008.
closed form bounds on the partition function.
With Dvijotham Krishnamurthy and Subhasis Chaudhuri.
ECML/PKDD 2008, Antwerp.
Winner of the best
student paper award.
- Structured Learning
for Non-Smooth Ranking Losses.
With Rajiv Khanna, Uma Sawant and Chiru Bhattacharyya.
SIGKDD 2008, Las Vegas.
- Learning to rank in vector spaces and social networks.
- Focused Web Crawling. Entry in the
Database Systems, 2008.
influence of search engines on preferential attachment.
With Alan Frieze and Juan Vera.
Internet Mathematics, volume 3, number 3 (2006--2007), pages 361--381.
A preliminary version
appeared in SODA 2005.
- Learning Random Walks to Rank
Nodes in Graphs. With Alekh Agarwal.
- Dynamic Personalized Pagerank
in Entity-Relation Graphs.
WWW 2007, Banff.
- Accelerating Newton optimization for
log-linear models through feature redundancy. With Arpit Mathur.
IEEE ICDM 2006,
- Learning parameters in entity-relationship
graphs from ranking preferences. With Alekh Agarwal.
- Learning to rank networked entities.
With Alekh Agarwal and Sunny Aggarwal.
SIGKDD Conference 2006,
Scoring Functions and Indexes for Proximity Search in Type-annotated
Corpora. With Kriti Puniyani and Sujatha Das.
WWW 2006, Edinburgh.
Answer Type Inference from Questions using Sequential Models.
With Vijay Krishnan and Sujatha Das.
- Bidirectional Expansion For Keyword Search on Graph Databases.
With Varun Kacholia, Shashank Pandit, S. Sudarshan,
Rushi Desai and Hrishikesh Karambelkar. VLDB 2005.
- Shuffling a Stacked Deck: The Case for Partially Randomized
Ranking of Search Engine Results.
With Sandeep Pandey, Sourashis Roy, Chris Olston, and Junghoo Cho.
- Is question answering an
With Ganesh Ramakrishnan, Deepa Paranjpe, and
New York City.
- Fast and accurate text classification
via multiple linear discriminant projections.
With Shourya Roy and Mahesh Soundalgekar.
VLDB Journal, 12(2), pages 170--185
[conference version, talk slides].
Learning Probabilistic Mappings Between Topics.
With Sunita Sarawagi and Shantanu Godbole.
SIGKDD Conference 2003,
- Monitoring the Dynamic Web
to respond to Continuous Queries.
With Sandeep Pandey and Krithi Ramamritham.
Budapest, Hungary, May 2003.
- Accelerated focused
crawling through online relevance feedback.
With Kunal Punera and Mallela Subramanyam.
WWW 2002, Hawaii.
- The structure of
broad topics on the Web.
With Mukul Joshi, Kunal Punera, and David M. Pennock.
WWW 2002, Hawaii.
Searching and Browsing in Databases using BANKS.
With Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, and S. Sudarshan.
In ICDE 2002. Also see the BANKS
home page. Winner of
the ICDE 2012
influential paper award.
topic distillation using text, markup tags, and hyperlinks.
With Mukul M. Joshi and Vivek B. Tawde.
In SIGIR 2001
Document Object Model with hyperlinks for enhanced
topic distillation and information extraction.
In the 10th International World Wide Web
Conference, Hong Kong, May 2001.
- Memex: A browsing assistant
for collaborative archiving and mining of surf trails.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
Demo at VLDB 2000.
Data mining for hypertext:
A tutorial survey.
Explorations, 1(2), pages 1--11, 2000.
Memex to archive and mine community Web browsing experience.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
In the 9th International World Wide Web
Conference, Amsterdam, May 2000.
bookmarking companies founded long after this paper:
Furl, Simpy, Citeulike, etc.
the Web's Link Structure. With Byron E. Dom, S. Ravi Kumar,
Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson,
and Jon Kleinberg. In
vol. 32, no. 8, August 1999
Distributed Hypertext Resource
Discovery Through Examples.
With Martin van den Berg and Byron Dom.
VLDB 1999, Edinburgh, Scotland.
the Web. With
Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,
Andrew Tomkins, Jon M. Kleinberg, and David Gibson.
Invited paper in Scientific American,
the Web Backwards. With D. A. Gibson and K. S. McCurley.
In WWW 1999.
Focused crawling: A
new approach to topic-specific Web resource discovery. With M. van
den Berg and B. Dom. WWW 1999,
Toronto, May 1999. Winner of the
best paper award. Also see the
Upcoming and recent talks and travel
- Keynote talk at
- Tutorial on Query Interpretation and
Representation for Searching the Web of Objects at
WWW 2013, Rio de Janeiro.
- WWW 2010 Conference, NC, April 2010.
talk at WSDM 2010, NYC, February 2010.
- WWW 2010 PC meeting, Salt Lake City, Utah,
- WWW 2009
- SIGIR 2008 PC meeting, University of Maryland, March 2008.
- WSDM 2008, Stanford University, February 2008.
- Tutorial on Learning to rank in vector
spaces and social networks at WWW
- Keynote talk at WAW
and a short
course at Banff, Nov 2006.
- Invited talk at the
on Intelligent Information Access, Helsinki, July 2006.
- Invited talk at the ICML 2005 workshop on Learning in Web Search.
- Invited talk at the ICML 2005 workshop on
Learning and Extending Lexical Ontologies
by using Machine Learning Methods.
- Panel discussion on exploiting dynamic
networking effects in Web advertising at
- Invited talk and position paper at
in Pisa, Sept. 2004.
- Short course on
machine learning for hypertext applications at
in Saarbrücken, Sept. 2004.
structures in data mining. A tutorial presented at
2004 with Christos
Text search for
fine-grained semi-structured data.
A tutorial presented at VLDB 2002.
Beyond hubs and authorities: spreading out and zooming in.
Invited talk at
ICDT International Workshop
on Web Dynamics, London, Jan. 2001.
Data Mining and Learning on the Web. NIPS Workshop, Denver,
Dec. 2000. By invitation.
content-based collaborative communities on the Web.
Invited talk at the Joint
Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora
(EMNLP/VLC), Hong Kong, Oct. 7--8, 2000.
Hypertext data mining:
A tutorial presented at the
Conference, Boston, August 2000.
- Hypertext databases and hypertext data mining.
SIGMOD 1999 Tutorial.
Method and system for searching unstructured
textual data for quantitative answers to queries.
System and method for focussed web crawling.
Enhanced hypertext categorization using hyperlinks.
System and method for scheduling web servers with a
quality-of-service guarantee for each user.
Method for interactively creating an information database including
preferred information elements, such as, preferred-authority,
world wide web pages.
Method for cataloging, filtering, and relevance ranking frame-based
hierarchical information structures.
Multilevel taxonomy based on features derived from training documents
classification using fisher values as discrimination values.
System and method for mining surprising temporal patterns.
Feature diffusion across hyperlinks.
Links in areas of interest