Soumen Chakrabarti

Contact information

I am SOUMEN CHAKRABARTI, anagram for ANARCHISM OUTBREAK, a faculty member in the Department of Computer Science.

If you are from industry looking for consultation, please visit our research and development site, my informal notes, and a sample mutual NDA.

If you are looking to join CSE@IITB as a PhD scholar, please read about the standard operating procedure and the PhD Qualifier model being adopted by the department, and contact the department office directly. PhD admissions is centrally coordinated at the department level.

I do not offer short term projects or summer internships to students not enrolled at IIT Bombay. Such emails will be discarded.

If you are an IIT student looking for a project or seminar within the scope of your program (Btech, DD, Mtech) please read these guidelines first. You can check my calendar for free slots and, if you have permission, propose a meeting here or by email.

The best way to contact me is to send mail to (please note that I am on a low-spam diet). Please use only email to initiate a conversation with me if we haven't communicated before. Only in case of an emergency, you can call me at +91-22-2576-7716 or fax me at +91-22-2572-0022. If you are visiting, here are directions to my office.

Education and career

Don Bosco School, Park Circus, Calcutta, 1975–1987 (memoirs).
Indian Institute of Technology, Kharagpur, 1987–1991.
University of California, Berkeley, 1991–1996.
IBM Almaden Research Center, 1996–1999.
IIT Bombay, 1999–present.
Carnegie-Mellon University, Spring 2004.
Google, Mountain View, 2014–2016.

Current research interests

Representation learning for graph search: We are exploring how to go beyond shallow graph neural networks to represent nodes, edges and graphs for better link prediction and searching a corpus of graphs with a query graph with trainable notions of subgraph isomorphism.
Better embedding representation for entities, types, relations, and time: We are studying how to embed entities, types, relations and time to infer new edges in regular and temporal knowledge graphs, and their application to (temporal) question answering.
Complex multi-modal question answering: With IBM Research, I am exploring how to translate complex queries involving knowledge base access, arithmetic and logical operations into structured programs with memory.
Code-switched text analysis: Indian languages borrow heavily from English, resulting in ``code switching'' languages like Hindlish, Benglish, etc., the lingua franca of social media. We are investigating how to improve standard NLP tasks by generating synthetic code-switched text, and designing multi-task low-supervision recurrent networks.

[The World Wide Web is] the only thing I know of whose shortened form — WWW — takes three times longer to say than what it's short for.
—Douglas Adams

Past projects

Searching the annotated Web with entities, types and relations: We built CSAW, a search system that integrates type and role annotations with keyword matches, thereby exploiting lexical ontologies and entity taggers within an information retrieval system.
Graph conductance search: Rich connections between random walks, graph eigensystems, and electrical networks make it attractive to apply them for ranking nodes. PageRank is a prominent example of the paradigm. In PageRank, the edge weights are fixed and we have to compute steady state probabilities of nodes. What if we have something like the opposite problem? And how to make this fast at query time? Supported by IBM and Microsoft (2007, 2008).
Integrating IR with databases: In the BANKS project, we proposed new paradigms of keyword search in graphs that can represent text embedded in relational or XML-like data.
The effect of search engines on the Web graph and page popularity: Search engines are influenced by the (in)degree of Web pages, but their ranked lists modulate page popularity and eventually their (in)degree, setting up a feedback to some degree. Might the evolution of the Web graph be influenced substantially by the existence of search engines? Is there a need to regulate monopolies? What are healthy economic objectives, and how to optimize them?
Focused crawlers to build topic-specific portals: A focused crawler collects a topic-specific subgraph of the Web by coupling classifiers and reinforcement learners with crawlers. An open-source focused crawler project was started at the Lab. for Intelligent Internet Research and is available.
Mining hypertext to estimate topics and popularity: I built a hypertext classifier that uses the text in and links around a given Web page to label it with a topic. This was an early application of Markov networks to Web analysis. As a member of the IBM Clever Project, I worked on algorithms to analyze the links around a web page and the text in pages that cite the given page to assign it a measure of popularity.
Compiling and running parallel scientific programs: In a previous life, my PhD thesis was on the design and implementation of compilers and runtime systems for distributed memory multiprocessors. Seems like distributed parallel computing is hot again, thanks to "Big Data"!
Downloads: Recent papers are listed below with accompanying git repo links. Some older software can be found here.

Professional activity

Journal editorship

Foundations and Trends in Information Retrieval (a review journal started in 2006) 2006–2022.
ACM Transactions on the Web, 2006–2011.
Data Mining and Knowledge Discovery (DMKD) Journal, Area Action Editor for Text and Web Mining, 2003–2005.
IEEE Transactions on Knowledge and Data Engineering (TKDE), Special Issue on Mining and Searching the Web, Guest Editor.

Conference/workshop organization

Linguistics Meets Image and Video Retrieval. Workshop at ICCV 2019. Co-organizer.
IJCAI 2019, area chair.
WWW 2017, poster track co-chair with Mounia Lalmas and Wei Chen.
CIKM 2014, area char for text and Web data mining.
EMNLP 2013, area chair for information retrieval and question answering.
WWW 2013, track chair for search, systems and applications.
SIGIR 2011, area chair for Web IR and social media search.
WWW 2010, program co-chair with Juliana Freire.
SIGIR 2010, senior PC member.
Web Search APIs: The Next Generation — A panel discussion at WWW 2009. Panel slides.
SIGIR 2009, Area Chair, Machine Learning for IR.
WSDM 2008 ("wisdom"), Program Co-chair with Andrei Broder.
VLDB 2007, Tutorial Co-Chair.
ECML-PKDD 2006, Area Chair, Track for mining links, graphs, trees and high-dimensional data.
WWW 2006, Deputy Chair, Data Mining track.
COMAD 2005b, Associate Program Chair.
WWW 2003, Vice Chair, Searching and Mining track.
ICDE 2003. Vice Chair, Data, Text and Web Mining track.
WWW 2002, Deputy Chair, Searching, Querying and Indexing track (CFP).

Conference/journal committee/reviewing

NeurIPS 2025 (area chair), NeurIPS 2024 (area chair), NeurIPS 2023 (area chair), EMNLP 2022, ARR 2020-, TACL 2020-2022, WSDM 2021 (senior PC), NeurIPS 2020, EMNLP 2020, ACL 2020, IJCAI 2020 (senior PC), AAAI 2020, EMNLP 2019, IJCAI 2019, ICML 2019, NeurIPS 2018, ICML 2018, NAACL 2018, WSDM 2018 (test of time awards), SIGIR 2017 (awards), SIGKDD 2017 (awards), WSDM 2017 (awards), NIPS 2017, ACL 2017; NIPS 2016, SIGIR 2016; CIKM 2014, ISWC 2014, SIGIR 2014, ACL 2014, WSDM 2014 (senior PC); SIGKDD 2013 (senior PC), WSDM 2013 (senior PC and awards committee); EMNLP 2012, SIGKDD 2012 (senior PC), WWW 2012; NIPS 2011, ICML 2011 (PC and invited applications talks committee), WWW 2011; SIGKDD 2010; NIPS 2009, WWW 2009, WSDM 2009 (senior PC); SIGKDD 2008 (senior PC), SIGIR 2008 (senior PC), WWW 2008; WWW 2007, SIGMOD 2007; SIGKDD 2006 (senior PC); EMNLP/HLT 2005, SIGKDD 2005, WWW 2005 (panel), SIGMOD 2005; SIGKDD 2004, SIGIR 2004, VLDB 2004, WWW 2004, ICDE 2004; SIGIR 2003, SIGKDD 2003, VLDB 2003 (IIS), SODA 2003; SIGIR 2002, ICDE 2002; SIGIR 2001, WWW 2001; WWW 2000; SIGKDD 1999; SIGKDD 1998.

Other

Web Search and Data Mining (WSDM) steering committee member, 2008–2013.
ACM SIGKDD Curriculum Committee Member.

Courses

But the power of instruction is seldom of much efficacy, except in those happy dispositions where it is almost superfluous.
—Edward Gibbon,
The Decline And Fall Of The Roman Empire
Volume 1, Chapter 4.

To reduce administrative overhead we will continue to use existing course codes CS635 (Autumn) and CS728 (Spring), but, since ChatGPT came out late 2022, we have again revamped the courses with new contents and removed some outdated material. CS635 is a soft prerequisite for CS728, but not enforced. Offerings: 2023.1S.CS728, 2023.2A.CS635, 2024.1S.CS728.
Web Search and Mining has been expanded to a two-semester sequence, shorthanded WMa (CS635, Autumn) and WMb (CS728, Spring). WMa retains the old course code, but has been planned from scratch. WMb will be largely about information extraction and integration, and querying over semistructured and graphical data representations. WMa Autumn 2009, WMb Spring 2010, WMa Autumn 2010, WMb Spring 2011, WMa Autumn 2011, WMa Spring 2013, WMa Autumn 2013, WMb Spring 2014, WMa Autumn 2016, WMb Spring 2017, WMa Autumn 2017, WMb Spring 2018, WMb Spring 2019, WMa Autumn 2019, WMb Spring 2020 (partly online), WMa Autumn 2020 (online), WMa Autumn 2020 (online), WMb Spring 2021 (online), WMa Autumn 2021 (online), WMb Spring 2022 prereq reading (online/hybrid), WMa Autumn 2022 (in-person), WMb Spring 2023.
Statistical Foundations of Machine Learning: Autumn 2005, Autumn 2006, Autumn 2007, Autumn 2008.
Web Search and Mining (earlier called Information Retrieval and Mining for Hypertext and the Web): Spring 2001, Spring 2002, Spring 2003, Spring 2005, Spring 2006 (new improved), Spring 2007, Spring 2008, Spring 2009.
Undergraduate Programming Languages, Spring 2000, Autumn 2000, Autumn 2001, Autumn 2002, Autumn 2003, Autumn 2004.
Computer programming and utilization aka CS101, Spring 2012.
Undergrad software lab: Autumn 2018.
Graduate software lab: Autumn 1999, Autumn 2000.

... your work is to keep cranking the flywheel that turns the gears
that spin the belt in the engine of belief that keeps you and your desk in midair
—Annie Dillard,
The Writing Life.

Publication Google Scholar, DBLP, arXiv, ResearchGate, SemanticScholar, ?

Position: Neural Approximation Is Rarely Justified for Hard Combinatorial Problems. With Pritish Chakraborty, Indradyumna Roy, and Abir De. ICML 2026.
A Dense Subset Index for Collective Query Coverage. With Kartik Nair, Pritish Chakraborty, Atharva Abhijit Tambat, Indradyumna Roy, Anirban Dasgupta, and Abir De. ICLR 2026.
Exchangeability of GNN Representations with Applications to Graph Retrieval. With Kartik Nair, Indradyumna Roy, Anirban Dasgupta, and Abir De. ICLR 2026. Oral paper.
Position: Graph Matching Systems Deserve Better Benchmarks. With Indradyumna Roy, Saswat Meher, Eeshaan Jain, Abir De. ICML 2025.
Contextual Tokenization for Graph Inverted Indices. With Pritish Chakraborty, Indradyumna Roy, and Abir De. NeurIPS 2025.
Dense Retrieval with Quantity Comparison Intent. With Prayas Agrawal, Nandeesh Kumar K M, Muthusamy Chelliah, Surender Kumar. ACL Findings 2025.
MutantPrompt: Prompt Optimization via Mutation Under a Budget on Modest-sized LMs. With Arijit Nag, Animesh Mukherjee, Niloy Ganguly. ACL Findings 2025.
Efficient Continual Pre-training of LLMs for Low-resource Languages. With Arijit Nag, Animesh Mukherjee, Niloy Ganguly. NAACL 2025 (Industry Track).
Diverse In-Context Example Selection After Decomposing Programs and Aligned Utterances Improves Semantic Parsing. With Mayank Kothyari, Sunita Sarawagi, Srujana Merugu, Gaurav Arora. NAACL 2025. Oral paper.
Charting the Design Space of Neural Graph Representations for Subgraph Matching. With Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Abir De. ICLR 2025.
Clique Number Estimation via Differentiable Functions of Adjacency Matrix Permutations. With Indradyumna Roy, Eeshaan Jain, Abir De. ICLR 2025.
Graph Regularized Encoder Training for Extreme Classification. With Anshul Mittal, Shikhar Mohan, Deepak Saini, Siddarth Asokan, Suchith C. Prabhu, Lakshya Kumar, Pankaj Malhotra, Jian Jiao, Amit Singh, Sumeet Agarwal, Purushottam Kar and Manik Varma. WebConf (industry track) 2025. Oral paper.
Graph Representation of Tables+Text and Compact Subgraph Retrieval for QA Tasks. With Vishwajeet Kumar, Jaydeep Sen, Bhawna Chelani. ECIR 2025.
Graph Edit Distance with General Costs Using Neural Set Divergence. With Eeshaan Jain, Indradyumna Roy, Saswat Meher and Abir De. NeurIPS 2024. A preliminary version appears in LoG 2024.
Graph Edit Distance Evaluation Datasets: Pitfalls and Mitigation. With Eeshaan Jain, Indradyumna Roy, Saswat Meher, Abir De. LoG 2024.
Iteratively Refined Early Interaction Alignment for Subgraph Matching based Graph Retrieval. With Ashwin Ramachandran, Vaibhav Raj, Indradyumna Roy and Abir De. NeurIPS 2024.
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLMs. With Arijit Nag, Animesh Mukherjee, and Niloy Ganguly. EMNLP Findings, 2024.
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. With Subhabrata Dutta, Joykirat Singh, Tanmoy Chakraborty. TMLR 2024.
Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning. With Subhabrata Dutta, Joykirat Singh, Ishan Pandey, Sunny Manchanda and Tanmoy Chakraborty. AAAI 2024. Oral paper. code
CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. With Mayank Kothyari, Dhruva Dhingra, and Sunita Sarawagi. EMNLP 2023. code
Small Language Models Fine-Tuned for Decomposition and Solution Improve Complex Reasoning. With Gurusha Juneja, Subhabrata Dutta, Sunny Manchanda and Tanmoy Chakraborty. EMNLP 2023.
Locality Sensitive Hashing in Fourier Frequency Domain For Soft Set Containment Search. With Indradyumna Roy, Rishi Agarwal, Anirban Dasgupta, and Abir De. NeurIPS 2023. Spotlight paper.
mOKB6: A Multilingual Open Knowledge Base Completion Benchmark. With Shubham Mittal, Keshav Kolluru and Mausam. ACL 2023.
Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. With Arijit Nag, Bidisha Samanta, Animesh Mukherjee, and Niloy Ganguly. ACL Findings, 2023.
Multi-Row, Multi-Span Distant Supervision For Table+Text Question Answering. With Vishwajeet Kumar, Saneem Chemmengath, Yash Gupta, Jaydeep Sen, Samarth Bharadwaj and Feifei Pan. ACL 2023.
TwiRGCN: Temporally Weighted Graph Convolution for Question Answering over Temporal Knowledge Graphs. With Aditya Sharma, Apoorv Saxena, Chitrank Gupta, Seyed Mehran Kazemi, and Partha Talukdar. EACL 2023.
Structured Case-based Reasoning for Inference-time Adaptation of Text-to-SQL parsers. With Abhijeet Awasthi and Sunita Sarawagi. AAAI 2023.
Joint Completion and Alignment of Multilingual Knowledge Graphs. With Harkanwar Singh, Shubham Lohiya, Prachi Jain and Mausam. EMNLP 2022. A preliminary version appeared in AKBC 2021. arXiv version. code
Maximum Common Subgraph Guided Graph Retrieval: Late and Early Interaction Networks. With Indradyumna Roy and Abir De. NeurIPS 2022.
Neural Estimation of Submodular Functions with Applications to Differentiable Subset Selection. With Abir De. NeurIPS 2022.
Transfer Learning for Low Resource Multilingual Relation Classification. With Arijit Nag, Bidisha Samanta, Animesh Mukherjee and Niloy Ganguly. TALLIP 2022. A preliminary version appeared in CoNLL 2021. data
VarScene: A Deep Generative Model for Realistic Scene Graph Synthesis. With Tathagat Verma, Abir De, Yateesh Agrawal, and Vishwa Vinay. ICML 2022.
Incomplete Gamma Integrals for Deep Cascade Prediction using Content, Network, and Exogenous Signals. With Subhabrata Dutta, Shravika Mittal, Dipankar Das, and Tanmoy Chakraborty. IEEE TKDE 2022.
AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry. With Yannis Katsis, Saneem Ahmed Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, Mustafa Canim, Michael Glass, Alfio Gliozzo, Feifei Pan, Jaydeep Sen, and Karthik Sankaranarayanan. NAACL 2022. data
Alignment-Augmented Consistent Translation for Multilingual Open Information Extraction. With Keshav Kolluru, Muqeeth M, Shubham Mittal, and Mausam. ACL 2022.
Interpretable Neural Subgraph Matching for Graph Retrieval. With Indradyumna Roy, Venkata Sai Velugoti and Abir De. AAAI 2022.
Semi-supervised stance detection of tweets via distant network supervision. With Subhabrata Dutta, Samiya Caur, and Tanmoy Chakraborty. WSDM 2022.
Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations. With Vihari Piratla and Sunita Sarawagi. NeurIPS 2021. code
Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems. With Subhabrata Dutta, Tanya Gautam, and Tanmoy Chakraborty. NeurIPS 2021. Spotlight paper.
T3QA: Topic Transferable Table Question Answering. With Saneem Chemmengath, Vishwajeet Kumar, Samarth Bharadwaj, Jaydeep Sen, Mustafa Canim, Alfio Gliozzo and Karthik Sankaranarayanan. EMNLP 2021.
A Data Bootstrapping Recipe for Low-Resource Multilingual Relation Classification. With Arijit Nag, Bidisha Samanta, Animesh Mukherjee and Niloy Ganguly. CoNLL 2021.
Multilingual Knowledge Graph Completion With Joint Relation and Entity Alignment. With Harkanwar Singh, Prachi Jain, Sharod Roy Choudhury, and Mausam. AKBC 2021.
Integrating Transductive and Inductive Embeddings Improves Link Prediction Accuracy. With Chitrank Gupta, Yash Jain, and Abir De. CIKM 2021.
Question Answering over Temporal Knowledge Graphs. With Apoorv Saxena and Partha Talukdar. ACL 2021. code trackback
Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering. With Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, and Ganesh Ramakrishnan. SIGIR 2021. code
Joint Autoregressive and Graph Models for Software and Developer Social Networks. With Rima Hazra, Hardik Aggarwal, Pawan Goyal, and Animesh Mukherjee. ECIR 2021. (Data.)
Adversarial Permutation Guided Node Representations for Link Prediction. With Indradyumna Roy and Abir De. AAAI 2021.
Differentially Private Link Prediction With Protected Connections. With Abir De. AAAI 2021.
Temporal Knowledge Base Completion: New Algorithms and Evaluation Protocols. With Prachi Jain, Sushant Rathi, and Mausam. EMNLP 2020. code
OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction. With Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal and Mausam. EMNLP 2020. code
NLP Service APIs and Models for Efficient Registration of New Clients. With Sahil Shah, Vihari Piratla, and Sunita Sarawagi. EMNLP Findings 2020.
Deep Exogenous and Endogenous Influence Combination for Social Chatter Intensity Prediction. With Subhabrata Dutta, Sarah Masud and Tanmoy Chakraborty. SIGKDD 2020.
Deep Neural Matching Models for Graph Retrieval. With Utkarsh Gupta, Kunal Goyal, and Abir De. SIGIR 2020.
Interpretable complex question answering. WebConf 2020.
IMOJIE: Iterative Memory-Based Joint Open Information Extraction. With Keshav Kolluru, Samarth Aggarwal, Vipul Rathore, and Mausam. ACL 2020. code
Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus. With Uma Sawant, Saurabh Garg, and Ganesh Ramakrishnan. Information Retrieval Journal, 2019. Presented at ECIR 2020.
Analysis of reference and citation copying in evolving bibliographic networks. With Pradumn Kumar Pandey, Mayank Singh, Pawan Goyal and Animesh Mukherjee. Journal of Informetrics, 2020.
On Computing Entity Relatedness in Wikipedia, with Applications. With Marco Ponza and Paolo Ferragina. Knowledge-Based Systems, 2020. code, data
Learning Linear Influence Models in Social Networks from Transient Opinion Dynamics. With Abir De, Sourangshu Bhattacharya, Parantapa Bhattacharya, and Niloy Ganguly. ACM TWEB 2019. Preliminary version in CIKM 2014.
Neural Program Induction for KBQA Without Gold Programs or Query Annotations. With Ghulam Ahmed Ansari, Amrita Saha, Vishwajeet Kumar, Mohan Bhambhani and Karthik Sankaranarayanan. IJCAI 2019. code
A Deep Generative Model for Code-Switched Text. With Bidisha Samanta, Sharmila Reddy, Hussain Jagirdar and Niloy Ganguly. IJCAI 2019. code
Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text. With Bidisha Samanta and Niloy Ganguly. ACL 2019.
Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings. With Vihari Piratla and Sunita Sarawagi. ACL 2019.
Complex Program Induction for Querying Knowledge Bases in the Absence of Gold Programs. With Amrita Saha, Ahmed Ansari, Abhishek Laddha and Karthik Sankaranarayanan. TACL 2019.
Multi-task Learning for Target-dependent Sentiment Classification. With Divam Gupta, Kushagra Singh, and Tanmoy Chakraborty. PAKDD 2019.
Automated Early Leaderboard Generation From Comparative Tables. With Mayank Singh, Rajdeep Sarkar, Atharva Vyas, Pawan Goyal, and Animesh Mukherjee. ECIR 2019.
GIRNet: Interleaved Multi-Task Recurrent State Sequence Models. With Divam Gupta and Tanmoy Chakraborty. AAAI 2019. code
Type-Sensitive Knowledge Base Inference Without Explicit Type Supervision. With Prachi Jain, Pankaj Kumar, and Mausam. ACL 2018. code
Mitigating the Effect of Out-of-Vocabulary Entity Pairs in Matrix Factorization for KB Inference. With Prachi Jain, Shikhar Murty, and Mausam. IJCAI 2018. code
New Embedded Representations and Evaluation Protocols for Inferring Transitive Relations. With Sandeep Subramanian. SIGIR 2018.
Open-domain question answering using a knowledge graph and Web corpus. With Uma Sawant and Ganesh Ramakrishnan. ACM SIGWEB Newsletter (invited), 2018.
Generalizing Across Domains via Cross-Gradient Training. With Shiv Shankar, Vihari Piratla, Siddhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi. ICLR 2018.
Task-Specific Representation Learning for Web-scale Entity Disambiguation. With Rijula Kar, Susmija Reddy, Sourangshu Bhattacharya and Anirban Dasgupta. AAAI 2018. code
A Two-Stage Framework for Computing Entity Relatedness in Wikipedia. With Marco Ponza and Paolo Ferragina. CIKM 2017. code, data
Relay-Linking Models for Prominence and Obsolescence in Evolving Networks. With Mayank Singh, Rajdeep Sarkar, Pawan Goyal, and Animesh Mukherjee. SIGKDD 2017. video
Earth Mover Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading. With Sachin Kumar and Shourya Roy. IJCAI 2017.
Collective Entity Resolution with Multi-Focal Attention. With Amir Globerson, Nevena Lazic, Amarnag Subramanya, Michael Ringgaard and Fernando Pereira. ACL 2016.
Discriminative Link Prediction using Local, Community, and Global Signals. With Abir De, Sourangshu Bhattacharya, Sourav Sarkar and Niloy Ganguly. IEEE TKDE Journal, 2016.
Knowledge Graph and Corpus Driven Segmentation and Answer Inference for Telegraphic Entity-seeking Queries. With Mandar Joshi and Uma Sawant. EMNLP 2014.
Quantity Queries on Web Tables: Annotation, Response and Consensus Models. With Sunita Sarawagi. SIGKDD 2014. code
Discriminative Link Prediction using Local Links, Node Features and Community Structure. With Abir De and Niloy Ganguly. ICDM 2013.
Joint Bootstrapping of Corpus Annotations and Entity Types. With Siddhanth Jain and Hrushikesh Mohapatra. EMNLP 2013.
Web-scale Entity Annotation Using MapReduce. With Shashank Gupta and Varun Chandramouli. HiPC 2013.
Learning Joint Query Interpretation and Response Ranking. With Uma Sawant. WWW 2013.
Compressed Data Structures for Annotated Web Search. With Sasidhar Kasturi, Bharath Balakrishnan, Ganesh Ramakrishnan, and Rohit Saraf. WWW 2012.
Diversity in ranking via resistive graph centers. With Avinava Dubey and Chiru Bhattacharyya. SIGKDD 2011. (Source code is available, contact Avinava Dubey for usage details.)
SCAD: Collective Discovery of Attribute Values. With Anton Bakalov, Ariel Fuxman, and Partha Talukdar. WWW 2011.
Index Design and Query Processing for Graph Conductance Search. With Amit Pathak and Manish Gupta. VLDB Journal, 2010.
Annotating and Searching Web Tables Using Entities, Types and Relationships. With Girija Limaye and Sunita Sarawagi. VLDB 2010.
Conditional Models for Non-smooth Ranking Loss Functions. With Avinava Dubey, Jinesh Machchhar, and Chiru Bhattacharyya. ICDM 2009, Miami.
Learning to rank for quantity consensus queries. With Somnath Banerjee and Ganesh Ramakrishnan. SIGIR 2009, Boston.
Collective annotation of Wikipedia entities in Web text. With Sayali Kulkarni, Amit Singh and Ganesh Ramakrishnan. SIGKDD 2009, Paris.
Text search enhanced with types and entities. Chapter in Text Mining: Theory, Application, and Visualization, Srivastava and Sahami, eds., 2008.
New closed form bounds on the partition function. With Dvijotham Krishnamurthy and Subhasis Chaudhuri. ECML/PKDD 2008, Antwerp. Winner of the best student paper award.
Structured Learning for Non-Smooth Ranking Losses. With Rajiv Khanna, Uma Sawant and Chiru Bhattacharyya. SIGKDD 2008, Las Vegas.
Learning to rank in vector spaces and social networks. Internet Mathematics, 2008.
Focused Web Crawling. Entry in the Encyclopedia of Database Systems, 2008.
The influence of search engines on preferential attachment. With Alan Frieze and Juan Vera. Internet Mathematics, volume 3, number 3 (2006–2007), pages 361–381. A preliminary version appeared in SODA 2005.
Learning Random Walks to Rank Nodes in Graphs. With Alekh Agarwal. ICML 2007, Oregon.
Dynamic Personalized Pagerank in Entity-Relation Graphs. WWW 2007, Banff.
Accelerating Newton optimization for log-linear models through feature redundancy. With Arpit Mathur. IEEE ICDM 2006, Hong Kong.
Learning parameters in entity-relationship graphs from ranking preferences. With Alekh Agarwal. ECML-PKDD 2006, Berlin.
Learning to rank networked entities. With Alekh Agarwal and Sunny Aggarwal. SIGKDD Conference 2006, Philadelphia.
Optimizing Scoring Functions and Indexes for Proximity Search in Type-annotated Corpora. With Kriti Puniyani and Sujatha Das. WWW 2006, Edinburgh.
Enhanced Answer Type Inference from Questions using Sequential Models. With Vijay Krishnan and Sujatha Das. EMNLP/HLT 2005, Vancouver.
Bidirectional Expansion For Keyword Search on Graph Databases. With Varun Kacholia, Shashank Pandit, S. Sudarshan, Rushi Desai and Hrishikesh Karambelkar. VLDB 2005.
Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results. With Sandeep Pandey, Sourashis Roy, Chris Olston, and Junghoo Cho. VLDB 2005.
Is question answering an acquired skill? With Ganesh Ramakrishnan, Deepa Paranjpe, and Pushpak Bhattacharyya. WWW2004, New York City.
Fast and accurate text classification via multiple linear discriminant projections. With Shourya Roy and Mahesh Soundalgekar. VLDB Journal, 12(2), pages 170–185 [conference version, talk slides].
Cross-Training: Learning Probabilistic Mappings Between Topics. With Sunita Sarawagi and Shantanu Godbole. SIGKDD Conference 2003, Washington D.C.
Monitoring the Dynamic Web to respond to Continuous Queries. With Sandeep Pandey and Krithi Ramamritham. WWW 2003, Budapest, Hungary, May 2003. (talk slides.)
Accelerated focused crawling through online relevance feedback. With Kunal Punera and Mallela Subramanyam. WWW 2002, Hawaii. (Local copy.)
The structure of broad topics on the Web. With Mukul Joshi, Kunal Punera, and David M. Pennock. WWW 2002, Hawaii. (Local copy.)
Keyword Searching and Browsing in Databases using BANKS. With Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, and S. Sudarshan. In ICDE 2002. Also see the BANKS home page. Winner of the ICDE 2012 influential paper award.
Enhanced topic distillation using text, markup tags, and hyperlinks. With Mukul M. Joshi and Vivek B. Tawde. In SIGIR 2001 (talk slides).
Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In the 10th International World Wide Web Conference, Hong Kong, May 2001.
Memex: A browsing assistant for collaborative archiving and mining of surf trails. With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari. Demo at VLDB 2000.
Data mining for hypertext: A tutorial survey. SIGKDD Explorations, 1(2), pages 1–11, 2000.
Using Memex to archive and mine community Web browsing experience. With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari. In the 9th International World Wide Web Conference, Amsterdam, May 2000. Talk slides. Social bookmarking companies founded long after this paper: HistorySE, Delicious, Digg, StumbleUpon, Reddit, Furl, Simpy, Citeulike, etc., and finally, Mozilla Pocket!
Mining the Link Structure of the World Wide Web. With Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, and Jon Kleinberg. In IEEE Computer, vol. 32, no. 8, August 1999.
Distributed Hypertext Resource Discovery Through Examples. With Martin van den Berg and Byron Dom. VLDB 1999, Edinburgh, Scotland. Talk slides.
Hypersearching the Web. With Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, Jon M. Kleinberg, and David Gibson. Invited paper in Scientific American, June 1999.
Surfing the Web Backwards. With D. A. Gibson and K. S. McCurley. In WWW 1999.
Focused crawling: A new approach to topic-specific Web resource discovery. With M. van den Berg and B. Dom. WWW 1999, Toronto, May 1999. Winner of the best paper award.

Talks and meetings

On Soft Permutation, Set Alignment, and Graph Matching. Keynote talk at FIRE 2024.
Deep Knowledge Graph Representation Learning for Completion, Alignment, and Question Answering. Tutorial at SIGIR 2022.
Temporal Knowledge Graph Representation and Question Answering. ACSS 2021, FIRE 2021.
The future of search and recommendation: Beyond web search: panel discussion at Microsoft Research Summit 2021.
A brief history of question answering. Invited talk at The Future of the Web track, WebConf 2021.
Graph Neural Networks and Knowledge Graph Completion. Distinguished lecture at the Kohli Center on Intelligent Systems, IIIT Hyderabad, 2021/03/30. Distinguished seminar at ConcertAI, 2021/03/10.
Knowledge Base Completion: The Role of Types and Time. Amazon Research Days, 2020.
Learning New Type Representations from Knowledge Graphs. Keynote talk at KG4IR 2018 (video).
Tutorial on knowledge extraction and inference from text. Subset of CIKM 2017 tutorial, at SIGIR 2018.
Answering questions: The shallow and the deep. TIFR STCS seminar. April 2018. Flipkart Blue Sky seminar, June 2018. Interview.
Tutorial with Partha Talukdar at CIKM 2017 on Knowledge Extraction and Inference from Text.
Keynote talk at CoDS 2017, Chennai, March 2017.
Keynote talk at CIKM 2014 Industry Track, Nov 2014.
Keynote talk at COMSNETS 2014, Bangalore, Jan 2014.
Tutorial on Query Interpretation and Representation for Searching the Web of Objects at WWW 2013, Rio de Janeiro.
WWW 2010 Conference, NC, April 2010.
Keynote talk at WSDM 2010, NYC, February 2010. [Talk slides.]
WWW 2010 PC meeting, Salt Lake City, Utah, January 2010.
WWW 2009 tutorial and panel, April 2009.
SIGIR 2008 PC meeting, University of Maryland, March 2008.
WSDM 2008, Stanford University, February 2008.
Tutorial on Learning to rank in vector spaces and social networks at WWW 2007, Banff.
Keynote talk at WAW and a short course at Banff, Nov 2006.
Invited talk at the International Workshop on Intelligent Information Access, Helsinki, July 2006.
Invited talk at the ICML 2005 workshop on Learning in Web Search.
Invited talk at the ICML 2005 workshop on Learning and Extending Lexical Ontologies by using Machine Learning Methods.
Panel discussion on exploiting dynamic networking effects in Web advertising at WWW 2005.
Invited talk and position paper at ECML/PKDD in Pisa, Sept. 2004.
Short course on machine learning for hypertext applications at ADFOCS in Saarbrücken, Sept. 2004.
Graph structures in data mining. A tutorial presented at SIGKDD 2004 with Christos Faloutsos.
Text search for fine-grained semi-structured data. A tutorial presented at VLDB 2002.
Beyond hubs and authorities: spreading out and zooming in. Invited talk at ICDT International Workshop on Web Dynamics, London, Jan. 2001.
Data Mining and Learning on the Web. NIPS Workshop, Denver, Dec. 2000. By invitation.
Nurturing content-based collaborative communities on the Web. Invited talk at the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), Hong Kong, Oct. 7–8, 2000.
Hypertext data mining: A tutorial presented at the SIGKDD Conference, Boston, August 2000.
Hypertext databases and hypertext data mining. SIGMOD 1999 Tutorial.

Patents

Determining NCCs and/or using the NCCs to adapt performance of computer-based action(s).
/US8447766 Method and system for searching unstructured textual data for quantitative answers to queries.
/US6112221 System and method for scheduling web servers with a quality-of-service guarantee for each user.
/US6418433 System and method for focussed web crawling.
/US6389436 Enhanced hypertext categorization using hyperlinks.
/US6336112B2/US6336112 Method for interactively creating an information database including preferred information elements, such as, preferred-authority, world wide web pages.
/US6334131 Method for cataloging, filtering, and relevance ranking frame-based hierarchical information structures.
Method and system for filtering of information entities.
Method and system for distributed autonomous maintenance of bidirectional hyperlink metadata on the web and similar hypermedia repository.
/ Feature diffusion across hyperlinks.
/US6189005 System and method for mining surprising temporal patterns.
System and method for dynamic index-probe optimizations for high-dimensional similarity search.
/US6233575 Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values.