I am SOUMEN CHAKRABARTI, anagram for ANARCHISM
OUTBREAK, a faculty member in the Department of Computer Science.
If you are from industry looking for consultation,
our research and
development site, my informal
notes, and a sample
If you are looking to join
CSE@IITB as a
PhD scholar, please
read about the
procedure and the
Qualifier model being adopted by the department,
and contact the department office directly.
PhD admissions is centrally coordinated at the department
I do not offer short term projects
or summer internships to students not enrolled at IIT
Bombay. Such emails will be discarded.
If you are an IIT student looking for a
project or seminar within the scope of
your program (Btech, DD, Mtech) please read
these guidelines first.
You can check my calendar for
free slots and, if you have permission,
propose a meeting here
or by email.
The best way to contact me is to send mail to
(please note that I am on a
low-spam diet). Please use
only email to initiate a conversation with me if we
haven't communicated before. Only in case of an emergency, you can
call me at +91-22-2576-7716 or fax me at +91-22-2572-0022. If you are
visiting, here are directions to my
Education and career
Don Bosco School,
Park Circus, Calcutta, 1975–1987
Indian Institute of Technology,
- University of California,
Research Center, 1996–1999
- IIT Bombay,
University, Spring 2004
- Google, Mountain View,
Current research interests
- Better embedding representation for passages, entities, types
- We are studying how to embed entities, types and relations to
infer new edges in knowledge graphs. We are investigating how to
represent passages to compare them.
- Complex multi-modal question answering
- With IBM Research, I am exploring how to translate complex
queries involving knowledge base access, arithmetic and logical
operations into structured programs with memory.
- Code-switched text analysis
- Indian languages borrow heavily from English, resulting in
``code switching'' languages like Hindlish, Benglish, etc., the
lingua franca of social media. We are investigating how to
improve standard NLP tasks by generating synthetic code-switched
text, and designing multi-task low-supervision recurrent
- Searching the annotated Web with entities, types and relations
- We built CSAW, a search system
that integrates type and role annotations with keyword matches,
thereby exploiting lexical ontologies and entity taggers.
- Graph conductance search
- Rich connections between random walks, graph eigensystems, and
electrical networks make it attractive to apply them for ranking
nodes. PageRank is a prominent example of the paradigm. In PageRank,
the edge weights are fixed and we have to compute steady state
probabilities of nodes. What if we have
something like the opposite problem?
And how to make this fast at query time?
and Microsoft (2007, 2008).
- Integrating IR with databases
- In the BANKS project,
we proposed new paradigms of keyword search in graphs that can
represent text embedded in relational or XML-like data.
- The effect of search engines on the Web graph and page popularity
- Search engines are influenced by the (in)degree of Web pages, but
their ranked lists modulate page popularity and eventually their
(in)degree, setting up a feedback to some degree. Might the evolution
of the Web graph be influenced substantially by the existence of
search engines? Is there a need to regulate monopolies? What are
healthy economic objectives, and how to optimize them?
- Focused crawlers to build topic-specific portals
- A focused crawler collects a topic-specific
subgraph of the Web by coupling classifiers and reinforcement learners
with crawlers. An open-source focused crawler project was started at
the Lab. for Intelligent
Internet Research and is available.
- Mining hypertext to estimate topics and popularity
- I built a hypertext
classifier that uses the text in and links around a given Web
page to label it with a topic. This was an early application of
Markov networks to Web analysis. As a member of the
Project, I worked on
to analyze the links around a web page and the text in pages that
cite the given page to assign it a measure of popularity.
- Compiling and running parallel scientific programs
- In a previous life, my PhD thesis
was on the design and implementation of compilers and
systems for distributed memory multiprocessors.
Seems like distributed parallel computing is hot again,
thanks to "Big Data"!
- Journal editorship
- Conference/workshop organization
- ACL 2020, Area Chair, Information Retrieval and Text Mining.
- Linguistics Meets Image and Video Retrieval. Workshop
at ICCV 2019.
- IJCAI 2019, area chair.
- WWW 2017, poster track co-chair with Mounia Lalmas
and Wei Chen.
- CIKM 2014, area char for text and Web data mining.
- EMNLP 2013,
area chair for information retrieval and question answering.
- WWW 2013, track chair for
search, systems and applications.
- SIGIR 2011,
area chair for Web IR and social media search.
- WWW 2010, program co-chair with
2010, senior PC member.
- Web Search APIs:
The Next Generation — A panel discussion at
- SIGIR 2009,
Area Chair, Machine Learning for IR.
- WSDM 2008 ("wisdom"),
Program Co-chair with
- VLDB 2007,
- ECML-PKDD 2006,
Area Chair, Track for mining links, graphs, trees and
- WWW 2006,
Deputy Chair, Data Mining track.
- COMAD 2005b,
Associate Program Chair.
- WWW 2003,
Vice Chair, Searching and Mining track.
- ICDE 2003.
Vice Chair, Data, Text and Web Mining track.
- WWW 2002, Deputy Chair,
Searching, Querying and Indexing
- Conference committee/reviewing
IJCAI 2020 (senior PC),
WSDM 2018 (test of time awards),
SIGIR 2017 (awards),
SIGKDD 2017 (awards),
WSDM 2017 (awards),
WSDM 2014 (senior PC);
SIGKDD 2013 (senior PC),
WSDM 2013 (senior PC and
SIGKDD 2012 (senior PC),
(PC and invited applications talks committee),
WSDM 2009 (senior PC);
SIGKDD 2008 (senior PC),
SIGIR 2008 (senior PC),
SIGKDD 2006 (senior PC);
VLDB 2003 (IIS),
- Web Search and Data Mining (WSDM) steering committee member, 2008–2013.
- ACM SIGKDD
Curriculum Committee Member.
But the power of instruction is seldom of much efficacy,
except in those happy dispositions where it is
The Decline And Fall Of The Roman Empire
Volume 1, Chapter 4.
- Web Search and Mining has been expanded to a two-semester
sequence, shorthanded WMa (Autumn) and WMb (Spring). WMa retains the
old course code, but has been planned from scratch. WMb will be
largely about information extraction and integration, and querying
over semistructured and graphical data representations.
WMa Autumn 2009,
WMb Spring 2010,
WMa Autumn 2010,
WMb Spring 2011,
WMa Autumn 2011,
WMa Spring 2013,
WMa Autumn 2013,
WMb Spring 2014,
WMa Autumn 2016,
WMb Spring 2017,
WMa Autumn 2017,
WMb Spring 2018,
WMb Spring 2019,
WMa Autumn 2019,
WMb Spring 2020 (partly online),
WMa Autumn 2020 (online).
- Statistical Foundations of Machine Learning:
- Web Search and Mining (earlier called
Information Retrieval and Mining for Hypertext and the Web):
- Undergraduate Programming Languages,
- Computer programming and utilization aka
- Undergrad software lab: Autumn 2018.
- Graduate software lab:
... your work is to keep cranking the flywheel that turns the gears
that spin the belt in the engine of belief that keeps you and your desk
The Writing Life.
- Deep Exogenous and
Endogenous Influence Combination for Social
Chatter Intensity Prediction.
With Subhabrata Dutta, Sarah Masud and Tanmoy Chakraborty.
- Deep Neural Matching Models for Graph Retrieval.
With Utkarsh Gupta, Kunal Goyal, and Abir De.
- IMOJIE: Iterative
Memory-Based Joint Open Information Extraction.
With Keshav Kolluru, Samarth Aggarwal,
Vipul Rathore, and Mausam. ACL 2020.
- Neural Architecture
for Question Answering Using a Knowledge Graph and Web Corpus.
With Uma Sawant, Saurabh Garg, and Ganesh Ramakrishnan.
Information Retrieval Journal, 2019. Presented at ECIR 2020.
- Analysis of reference and citation copying in evolving bibliographic networks. With Pradumn Kumar Pandey, Mayank Singh, Pawan Goyal and Animesh Mukherjee. Journal of Informetrics, 2020.
Computing Entity Relatedness in Wikipedia, with Applications.
With Marco Ponzaa and Paolo Ferragina. Knowledge-Based Systems,
Linear Influence Models in Social Networks from Transient Opinion
Dynamics. With Abir De, Sourangshu Bhattacharya, Parantapa
Bhattacharya, and Niloy Ganguly. ACM TWEB
version in CIKM 2014.
Program Induction for KBQA Without Gold Programs or Query
Annotations. With Ghulam Ahmed Ansari, Amrita Saha, Vishwajeet
Kumar, Mohan Bhambhani and Karthik Sankaranarayanan. IJCAI
- A Deep Generative
Model for Code-Switched Text. With Bidisha Samanta, Sharmila
Reddy, Hussain Jagirdar and Niloy Ganguly. IJCAI 2019.
Sentiment Detection via Label Transfer from Monolingual
to Synthetic Code-Switched Text.
With Bidisha Samanta and Niloy Ganguly. ACL 2019.
Sensitive Attention on Generic Corpora Corrects Sense Bias
in Pretrained Embeddings.
With Vihari Piratla and Sunita Sarawagi. ACL 2019.
- Complex Program Induction for Querying Knowledge
Bases in the Absence of Gold Programs. With Amrita Saha,
Ahmed Ansari, Abhishek Laddha and Karthik Sankaranarayanan.
Learning for Target-dependent Sentiment Classification.
With Divam Gupta, Kushagra Singh, and Tanmoy
Chakraborty. PAKDD 2019.
- Automated Early
Leaderboard Generation From Comparative Tables. With Mayank
Singh, Rajdeep Sarkar, Atharva Vyas, Pawan Goyal, and Animesh
Mukherjee. ECIR 2019.
- GIRNet: Interleaved
Multi-Task Recurrent State Sequence Models. With Divam
Gupta and Tanmoy Chakraborty. AAAI 2019.
Knowledge Base Inference Without Explicit Type Supervision.
With Prachi Jain, Pankaj Kumar, and Mausam. ACL 2018.
the Effect of Out-of-Vocabulary Entity Pairs in Matrix Factorization
for KB Inference. With Prachi Jain, Shikhar Murty, and Mausam.
- New Embedded
Representations and Evaluation Protocols for Inferring
Transitive Relations. With Sandeep
question answering using a knowledge graph and Web corpus.
With Uma Sawant and Ganesh Ramakrishnan. ACM SIGWEB Newsletter
Across Domains via Cross-Gradient Training. With Shiv
Shankar, Vihari Piratla, Siddhartha Chaudhuri, Preethi Jyothi,
and Sunita Sarawagi. ICLR
Representation Learning for Web-scale Entity Disambiguation.
With Rijula Kar, Susmija Reddy, Sourangshu Bhattacharya and
- A Two-Stage
Framework for Computing Entity Relatedness in Wikipedia. With
Marco Ponza and Paolo Ferragina.
- Relay-Linking Models for Prominence and Obsolescence in Evolving
Networks [paper, video].
With Mayank Singh, Rajdeep Sarkar, Pawan Goyal, and
Animesh Mukherjee. SIGKDD 2017.
- Earth Mover Distance Pooling
over Siamese LSTMs for Automatic Short Answer Grading.
With Sachin Kumar and Shourya Roy. IJCAI 2017.
Entity Resolution with Multi-Focal Attention.
With Amir Globerson, Nevena Lazic, Amarnag Subramanya, Michael Ringgaard
and Fernando Pereira. ACL 2016.
- Discriminative Link Prediction using Local, Community, and Global Signals.
With Abir De, Sourangshu Bhattacharya, Sourav Sarkar and Niloy Ganguly.
IEEE TKDE Journal, 2016.
Graph and Corpus Driven Segmentation and
Answer Inference for Telegraphic Entity-seeking Queries.
With Mandar Joshi and Uma Sawant.
Queries on Web Tables: Annotation, Response and Consensus Models.
With Sunita Sarawagi.
- Discriminative Link Prediction using Local Links, Node Features and Community Structure.
With Abir De and Niloy Ganguly. ICDM 2013.
Bootstrapping of Corpus Annotations and Entity Types.
With Siddhanth Jain and Hrushikesh Mohapatra.
- Web-scale Entity Annotation Using MapReduce.
With Shashank Gupta and Varun Chandramouli.
- Learning Joint Query Interpretation
and Response Ranking. With Uma Sawant.
- Compressed Data Structures for Annotated
Web Search. With
Sasidhar Kasturi, Bharath Balakrishnan,
Ganesh Ramakrishnan, and Rohit Saraf. WWW 2012.
- Diversity in ranking via
resistive graph centers.
With Avinava Dubey and Chiru Bhattacharyya.
SIGKDD 2011. (Source code
is available, contact Avinava Dubey for usage details.)
- SCAD: Collective Discovery of Attribute Values.
With Anton Bakalov, Ariel Fuxman, and Partha Talukdar.
- Index Design and Query Processing for Graph Conductance Search. With Amit Pathak and Manish Gupta.
VLDB Journal, 2010.
- Annotating and Searching Web Tables Using Entities, Types and
Relationships. With Girija Limaye and Sunita Sarawagi.
- Conditional Models for
Non-smooth Ranking Loss Functions.
With Avinava Dubey, Jinesh Machchhar, and Chiru Bhattacharyya.
ICDM 2009, Miami.
- Learning to rank for
quantity consensus queries.
With Somnath Banerjee and Ganesh Ramakrishnan.
SIGIR 2009, Boston.
- Collective annotation of
Wikipedia entities in Web text.
With Sayali Kulkarni, Amit Singh and Ganesh Ramakrishnan.
SIGKDD 2009, Paris.
- Text search enhanced with types and entities. Chapter in
Text Mining: Theory, Application, and Visualization,
Srivastava and Sahami, eds., 2008.
closed form bounds on the partition function.
With Dvijotham Krishnamurthy and Subhasis Chaudhuri.
ECML/PKDD 2008, Antwerp.
Winner of the best
student paper award.
- Structured Learning
for Non-Smooth Ranking Losses.
With Rajiv Khanna, Uma Sawant and Chiru Bhattacharyya.
SIGKDD 2008, Las Vegas.
- Learning to rank in vector spaces and social networks.
- Focused Web Crawling. Entry in the
Database Systems, 2008.
influence of search engines on preferential attachment.
With Alan Frieze and Juan Vera.
Internet Mathematics, volume 3, number 3 (2006–2007), pages 361–381.
A preliminary version
appeared in SODA 2005.
- Learning Random Walks to Rank
Nodes in Graphs. With Alekh Agarwal.
- Dynamic Personalized Pagerank
in Entity-Relation Graphs.
WWW 2007, Banff.
- Accelerating Newton optimization for
log-linear models through feature redundancy. With Arpit Mathur.
IEEE ICDM 2006,
- Learning parameters in entity-relationship
graphs from ranking preferences. With Alekh Agarwal.
- Learning to rank networked entities.
With Alekh Agarwal and Sunny Aggarwal.
SIGKDD Conference 2006,
Scoring Functions and Indexes for Proximity Search in Type-annotated
Corpora. With Kriti Puniyani and Sujatha Das.
WWW 2006, Edinburgh.
Answer Type Inference from Questions using Sequential Models.
With Vijay Krishnan and Sujatha Das.
- Bidirectional Expansion For Keyword Search on Graph Databases.
With Varun Kacholia, Shashank Pandit, S. Sudarshan,
Rushi Desai and Hrishikesh Karambelkar. VLDB 2005.
- Shuffling a Stacked Deck: The Case for Partially Randomized
Ranking of Search Engine Results.
With Sandeep Pandey, Sourashis Roy, Chris Olston, and Junghoo Cho.
- Is question answering an
With Ganesh Ramakrishnan, Deepa Paranjpe, and
New York City.
- Fast and accurate text classification
via multiple linear discriminant projections.
With Shourya Roy and Mahesh Soundalgekar.
VLDB Journal, 12(2), pages 170–185
[conference version, talk slides].
Learning Probabilistic Mappings Between Topics.
With Sunita Sarawagi and Shantanu Godbole.
SIGKDD Conference 2003,
- Monitoring the Dynamic Web
to respond to Continuous Queries.
With Sandeep Pandey and Krithi Ramamritham.
Budapest, Hungary, May 2003.
- Accelerated focused
crawling through online relevance feedback.
With Kunal Punera and Mallela Subramanyam.
WWW 2002, Hawaii.
- The structure of
broad topics on the Web.
With Mukul Joshi, Kunal Punera, and David M. Pennock.
WWW 2002, Hawaii.
Searching and Browsing in Databases using BANKS.
With Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, and S. Sudarshan.
In ICDE 2002. Also see the BANKS
home page. Winner of
the ICDE 2012
influential paper award.
topic distillation using text, markup tags, and hyperlinks.
With Mukul M. Joshi and Vivek B. Tawde.
In SIGIR 2001
Document Object Model with hyperlinks for enhanced
topic distillation and information extraction.
In the 10th International World Wide Web
Conference, Hong Kong, May 2001.
- Memex: A browsing assistant
for collaborative archiving and mining of surf trails.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
Demo at VLDB 2000.
Data mining for hypertext:
A tutorial survey.
Explorations, 1(2), pages 1–11, 2000.
Memex to archive and mine community Web browsing experience.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
In the 9th International World Wide Web
Conference, Amsterdam, May 2000.
bookmarking companies founded long after this paper:
Furl, Simpy, Citeulike, etc.
the Web's Link Structure. With Byron E. Dom, S. Ravi Kumar,
Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson,
and Jon Kleinberg. In
vol. 32, no. 8, August 1999
Distributed Hypertext Resource
Discovery Through Examples.
With Martin van den Berg and Byron Dom.
VLDB 1999, Edinburgh, Scotland.
the Web. With
Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,
Andrew Tomkins, Jon M. Kleinberg, and David Gibson.
Invited paper in Scientific American,
the Web Backwards. With D. A. Gibson and K. S. McCurley.
In WWW 1999.
Focused crawling: A
new approach to topic-specific Web resource discovery. With M. van
den Berg and B. Dom. WWW 1999,
Toronto, May 1999. Winner of the
best paper award. Also see the
Upcoming and past talks and meetings
- Learning New Type Representations from Knowledge Graphs.
Keynote talk at KG4IR
- Tutorial on knowledge extraction and inference from text.
Subset of CIKM 2017 tutorial, at
- Answering questions: The shallow and the deep.
TIFR STCS seminar.
Blue Sky seminar, June 2018. Interview.
with Partha Talukdar
at CIKM 2017 on Knowledge
Extraction and Inference from Text.
- Keynote talk at CoDS
2017, Chennai, March 2017.
- Keynote talk at
Industry Track, Nov 2014.
- Keynote talk at
COMSNETS 2014, Bangalore, Jan 2014.
- Tutorial on Query Interpretation and
Representation for Searching the Web of Objects at
WWW 2013, Rio de Janeiro.
- WWW 2010 Conference, NC, April 2010.
talk at WSDM 2010, NYC, February 2010.
- WWW 2010 PC meeting, Salt Lake City, Utah,
- WWW 2009
- SIGIR 2008 PC meeting, University of Maryland, March 2008.
- WSDM 2008, Stanford University, February 2008.
- Tutorial on Learning to rank in vector
spaces and social networks at WWW
- Keynote talk at WAW
and a short
course at Banff, Nov 2006.
- Invited talk at the
on Intelligent Information Access, Helsinki, July 2006.
- Invited talk at the ICML 2005 workshop on Learning in Web Search.
- Invited talk at the ICML 2005 workshop on
Learning and Extending Lexical Ontologies
by using Machine Learning Methods.
- Panel discussion on exploiting dynamic
networking effects in Web advertising at
- Invited talk and position paper at
in Pisa, Sept. 2004.
- Short course on
machine learning for hypertext applications at
in Saarbrücken, Sept. 2004.
structures in data mining. A tutorial presented at
2004 with Christos
Text search for
fine-grained semi-structured data.
A tutorial presented at VLDB 2002.
Beyond hubs and authorities: spreading out and zooming in.
Invited talk at
ICDT International Workshop
on Web Dynamics, London, Jan. 2001.
Data Mining and Learning on the Web. NIPS Workshop, Denver,
Dec. 2000. By invitation.
content-based collaborative communities on the Web.
Invited talk at the Joint
Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora
(EMNLP/VLC), Hong Kong, Oct. 7–8, 2000.
Hypertext data mining:
A tutorial presented at the
Conference, Boston, August 2000.
- Hypertext databases and hypertext data mining.
SIGMOD 1999 Tutorial.
- Determining NCCs
and/or using the NCCs to adapt performance of computer-based
Method and system for searching unstructured
textual data for quantitative answers to queries.
System and method for scheduling web servers with a
quality-of-service guarantee for each user.
System and method for focussed web crawling.
Enhanced hypertext categorization using hyperlinks.
Method for interactively creating an information database including
preferred information elements, such as, preferred-authority,
world wide web pages.
Method for cataloging, filtering, and relevance ranking frame-based
hierarchical information structures.
Method and system for filtering of information entities.
- Method and system
for distributed autonomous maintenance of bidirectional hyperlink
metadata on the web and similar hypermedia repository.
Feature diffusion across hyperlinks.
System and method for mining surprising temporal patterns.
- System and method
for dynamic index-probe optimizations for high-dimensional
Multilevel taxonomy based on features derived from training documents
classification using fisher values as discrimination values.
Links in areas of interest