Contact information
I am SOUMEN CHAKRABARTI, anagram for ANARCHISM
OUTBREAK, a faculty member in the Department of Computer Science.
If you are from industry looking for
consultation, please read the section
titled Consultative practice rules and norms (1996)
herein,
and my informal notes.
If you are looking to join as a PhD scholar, please
read about the PhD
Qualifier model being adopted by the department.
At the moment I am
not
offering short-term projects to
students not enrolled in a regular program at IIT Bombay. Only if you are
looking for at least a year-long position, send email with an ASCII
resume.
If you are an IIT student looking for a
project or seminar within the scope of
your program (Btech, DD, Mtech) please read
these guidelines first.
You can check my calendar for
free slots and, if you have permission,
propose a meeting here.
The best way to contact me is to send mail to
(please note that I am on a
low-spam diet). Or you can
call me at +91-22-2576-7716 or fax me at +91-22-2572-0022. Please use
only email to initiate a conversation with me if we
haven't communicated before, unless it is an emergency. If you are
visiting, here are directions to my
office.
This page is lazily mirrored at
IITB and
UC Berkeley.
The IITB version is usually up-to-date.
For legalese scroll to the end.
Education and career
-
Don Bosco School,
Park Circus, Calcutta, 1975--1987
-
Indian Institute of Technology,
Kharagpur, 1987--1991
- University of California,
Berkeley, 1991--1996
-
IBM Almaden
Research Center, 1996--1999
- IIT Bombay,
1999--present
- Carnegie-Mellon
University, Spring 2004
Research interests
- Searching graph data models using entities and relations
- I am interested in building new search systems that integrate
type and role annotations with keyword matches, thereby exploiting
lexical ontologies and entity taggers.
This research is supported by
IBM
and Microsoft (2007, 2008).
- Integrating IR with databases
- In the BANKS project,
we added two broad paradigms of keyword search in graphs that can
represent text embedded in relational or XML-like data. Watch
this space for updates on our SPIN project.
- The effect of search engines on the Web graph and page popularity
- Search engines are influenced by the (in)degree of Web pages, but
their ranked lists modulate page popularity and eventually their
(in)degree, setting up a feedback to some degree. Might the evolution
of the Web graph be influenced substantially by the existence of
search engines? Is there a need to regulate monopolies? What are
healthy economic objectives, and how to optimize them?
- Focused crawlers to build topic-specific portals
- A focused crawler collects a topic-specific
subgraph of the Web by coupling classifiers and reinforcement learners
with crawlers. An open-source focused crawler project was started at
the Lab. for Intelligent
Internet Research and is available
now.
- Mining hypertext to estimate topics and popularity
- I built a hypertext
classifier that uses the text in and links around a given Web
page to label it with a topic. This was an early application of
Markov networks to Web analysis. As a member of the
IBM Clever
Project, I worked on
algorithms
to analyze the links around a web page and the text in pages that
cite the given page to assign it a measure of popularity.
- Compiling and running parallel scientific programs
- My PhD thesis was on the
design and implementation of compilers and
runtime
systems for distributed memory multiprocessors.
-
Professional activity
- Journal editorship
- Conference organization
- EMNLP 2013,
area chair for information retrieval and question answering.
- WWW 2013, track chair for
search, systems and applications.
- SIGIR 2011,
area chair for Web IR and social media search.
- WWW 2010, program co-chair with
Juliana Freire.
- SIGIR
2010, senior PC member.
- Web Search APIs:
The Next Generation --- A panel discussion at
WWW 2009.
Panel slides.
- SIGIR 2009,
Area Chair, Machine Learning for IR.
- WSDM 2008 ("wisdom"),
Program Co-chair with
Andrei Broder.
- VLDB 2007,
Tutorial Co-Chair.
- ECML-PKDD 2006,
Area Chair, Track for mining links, graphs, trees and
high-dimensional data.
- WWW 2006,
Deputy Chair, Data Mining track.
- COMAD 2005b,
Associate Program Chair.
- WWW 2003,
Vice Chair, Searching and Mining track.
- ICDE 2003.
Vice Chair, Data, Text and Web Mining track.
- WWW 2002, Deputy Chair,
Searching, Querying and Indexing
track (CFP).
- Conference committee/reviewing
-
WSDM 2014 (senior PC),
SIGKDD 2013 (senior PC),
WSDM 2013 (senior PC and
awards committee),
EMNLP 2012,
SIGKDD 2012 (senior PC),
WWW 2012,
NIPS 2011,
ICML 2011
(PC and invited applications talks committee),
WWW 2011,
SIGKDD 2010,
NIPS 2009,
WWW 2009,
WSDM 2009 (senior PC),
SIGKDD 2008 (senior PC),
SIGIR 2008 (senior PC),
WWW 2008,
WWW 2007,
SIGMOD 2007,
SIGKDD 2006 (senior PC),
EMNLP/HLT 2005,
SIGKDD 2005,
WWW 2005
(panel),
SIGMOD 2005,
SIGKDD 2004,
SIGIR 2004,
VLDB 2004,
WWW 2004,
ICDE 2004,
SIGIR 2003,
SIGKDD 2003,
VLDB 2003 (IIS),
SODA 2003,
SIGIR 2002,
ICDE 2002,
SIGIR 2001,
WWW 2001,
WWW 2000,
SIGKDD 1999,
AAAI
SIGKDD 1998.
- Other
- Web Search and Data Mining (WSDM) steering committee member.
- ACM SIGKDD
Curriculum Committee Member.
Courses
- Web Search and Mining has been expanded to a two-semester
sequence, shorthanded WMa (Autumn) and WMb (Spring). WMa retains the
old course code, but has been planned from scratch. WMb will be
largely about information extraction and integration, and querying
over semistructured and graphical data representations.
WMa Autumn 2009,
WMb Spring 2010,
WMa Autumn 2010,
WMb Spring 2011,
WMa Autumn 2011.
- Statistical Foundations of Machine Learning:
Autumn 2005, Autumn 2006, Autumn 2007, Autumn 2008.
- Web Search and Mining (earlier called
Information Retrieval and Mining for Hypertext and the Web):
Spring 2001,
Spring 2002,
Spring 2003,
Spring 2005,
Spring 2006
(new improved),
Spring 2007,
Spring 2008,
Spring 2009.
- Undergraduate Programming Languages,
Spring 2000,
Autumn 2000,
Autumn 2001,
Autumn 2002,
Autumn 2003,
Autumn 2004.
- Graduate Software Lab:
Autumn 1999,
Autumn 2000.
... your work is to keep cranking the flywheel that turns the gears
that spin the belt in the engine of belief that keeps you and your desk
in midair
---Annie Dillard, in The Writing life.
Representative publication
,
- Learning Joint Query Interpretation and Response Ranking.
With Uma Sawant. WWW 2013.
- Compressed Data Structures for Annotated
Web Search. With
Sasidhar Kasturi, Bharath Balakrishnan,
Ganesh Ramakrishnan, and Rohit Saraf. WWW 2012.
- Diversity in ranking via
resistive graph centers.
With Avinava Dubey and Chiru Bhattacharyya.
SIGKDD 2011. (Source code
is available, contact Avinava Dubey for usage details.)
- SCAD: Collective Discovery of Attribute Values.
With Anton Bakalov, Ariel Fuxman, and Partha Talukdar.
WWW 2011.
- Index Design and Query Processing for Graph Conductance Search.
With Amit Pathak and Manish Gupta.
VLDB Journal, 2010.
- Annotating and Searching Web Tables Using Entities, Types and
Relationships. With Girija Limaye and Sunita Sarawagi.
VLDB 2010.
- Conditional Models for
Non-smooth Ranking Loss Functions.
With Avinava Dubey, Jinesh Machchhar, and Chiru Bhattacharyya.
ICDM 2009, Miami.
- Learning to rank for
quantity consensus queries.
With Somnath Banerjee and Ganesh Ramakrishnan.
SIGIR 2009, Boston.
- Collective annotation of
Wikipedia entities in Web text.
With Sayali Kulkarni, Amit Singh and Ganesh Ramakrishnan.
SIGKDD 2009, Paris.
- Text search enhanced with types and entities. Chapter in
Text Mining: Theory, Application, and Visualization,
Srivastava and Sahami, eds., 2008.
- New
closed form bounds on the partition function.
With Dvijotham Krishnamurthy and Subhasis Chaudhuri.
ECML/PKDD 2008, Antwerp.
Winner of the best
student paper award.
- Structured Learning
for Non-Smooth Ranking Losses.
With Rajiv Khanna, Uma Sawant and Chiru Bhattacharyya.
SIGKDD 2008, Las Vegas.
- Learning to rank in vector spaces and social networks.
Internet
Mathematics, 2008.
- Focused Web Crawling. Entry in the
Encyclopedia of
Database Systems, 2008.
- The
influence of search engines on preferential attachment.
With Alan Frieze and Juan Vera.
Internet Mathematics, volume 3, number 3 (2006--2007), pages 361--381.
A preliminary version
appeared in SODA 2005.
- Learning Random Walks to Rank
Nodes in Graphs. With Alekh Agarwal.
ICML 2007,
Oregon.
- Dynamic Personalized Pagerank
in Entity-Relation Graphs.
WWW 2007, Banff.
- Accelerating Newton optimization for
log-linear models through feature redundancy. With Arpit Mathur.
IEEE ICDM 2006,
Hong Kong.
- Learning parameters in entity-relationship
graphs from ranking preferences. With Alekh Agarwal.
ECML-PKDD 2006,
Berlin.
- Learning to rank networked entities.
With Alekh Agarwal and Sunny Aggarwal.
SIGKDD Conference 2006,
Philadelphia.
- Optimizing
Scoring Functions and Indexes for Proximity Search in Type-annotated
Corpora. With Kriti Puniyani and Sujatha Das.
WWW 2006, Edinburgh.
- Enhanced
Answer Type Inference from Questions using Sequential Models.
With Vijay Krishnan and Sujatha Das.
EMNLP/HLT 2005,
Vancouver.
- Bidirectional Expansion For Keyword Search on Graph Databases.
With Varun Kacholia, Shashank Pandit, S. Sudarshan,
Rushi Desai and Hrishikesh Karambelkar. VLDB 2005.
- Shuffling a Stacked Deck: The Case for Partially Randomized
Ranking of Search Engine Results.
With Sandeep Pandey, Sourashis Roy, Chris Olston, and Junghoo Cho.
VLDB 2005.
- Is question answering an
acquired skill?
With Ganesh Ramakrishnan, Deepa Paranjpe, and
Pushpak Bhattacharyya.
WWW2004,
New York City.
- Fast and accurate text classification
via multiple linear discriminant projections.
With Shourya Roy and Mahesh Soundalgekar.
VLDB Journal, 12(2), pages 170--185
[conference version, talk slides].
- Cross-Training:
Learning Probabilistic Mappings Between Topics.
With Sunita Sarawagi and Shantanu Godbole.
SIGKDD Conference 2003,
Washington D.C.
- Monitoring the Dynamic Web
to respond to Continuous Queries.
With Sandeep Pandey and Krithi Ramamritham.
WWW 2003,
Budapest, Hungary, May 2003.
(talk slides.)
- Accelerated focused
crawling through online relevance feedback.
With Kunal Punera and Mallela Subramanyam.
WWW 2002, Hawaii.
(Local copy.)
- The structure of
broad topics on the Web.
With Mukul Joshi, Kunal Punera, and David M. Pennock.
WWW 2002, Hawaii.
(Local copy.)
-
Keyword
Searching and Browsing in Databases using BANKS.
With Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, and S. Sudarshan.
In ICDE 2002. Also see the BANKS
home page. Winner of
the ICDE 2012
influential paper award.
-
Enhanced
topic distillation using text, markup tags, and hyperlinks.
With Mukul M. Joshi and Vivek B. Tawde.
In SIGIR 2001
(talk slides).
-
Integrating the
Document Object Model with hyperlinks for enhanced
topic distillation and information extraction.
In the 10th International World Wide Web
Conference, Hong Kong, May 2001.
- Memex: A browsing assistant
for collaborative archiving and mining of surf trails.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
Demo at VLDB 2000.
-
Data mining for hypertext:
A tutorial survey.
SIGKDD
Explorations, 1(2), pages 1--11, 2000.
-
Using
Memex to archive and mine community Web browsing experience.
With Sandeep Srivastava, Mallela Subramanyam and Mitul Tiwari.
In the 9th International World Wide Web
Conference, Amsterdam, May 2000.
Talk slides.
Social
bookmarking companies founded long after this paper:
HistorySE,
Delicious,
Digg,
StumbleUpon,
Reddit,
Furl, Simpy, Citeulike, etc.
-
Mining
the Web's Link Structure. With Byron E. Dom, S. Ravi Kumar,
Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson,
and Jon Kleinberg. In
IEEE Computer,
vol. 32, no. 8, August 1999
(IEEE
copy).
-
Distributed Hypertext Resource
Discovery Through Examples.
With Martin van den Berg and Byron Dom.
VLDB 1999, Edinburgh, Scotland.
Talk slides.
-
Hypersearching
the Web. With
Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan,
Andrew Tomkins, Jon M. Kleinberg, and David Gibson.
Invited paper in Scientific American,
June 1999.
-
Surfing
the Web Backwards. With D. A. Gibson and K. S. McCurley.
In WWW 1999.
-
Focused crawling: A
new approach to topic-specific Web resource discovery. With M. van
den Berg and B. Dom. WWW 1999,
Toronto, May 1999. Winner of the
best paper award. Also see the
project page.
Upcoming and recent talks and travel
- Tutorial on Query Interpretation and Representation
for Searching the Web of Objects at
WWW 2013, Rio de Janeiro.
- WWW 2010 Conference, NC, April 2010.
- Keynote
talk at WSDM 2010, NYC, February 2010.
[Talk slides.]
- WWW 2010 PC meeting, Salt Lake City, Utah,
January 2010.
- WWW 2009
tutorial
and panel,
April 2009.
- SIGIR 2008 PC meeting, University of Maryland, March 2008.
- WSDM 2008, Stanford University, February 2008.
- Tutorial on Learning to rank in vector
spaces and social networks at WWW
2007, Banff.
- Keynote talk at WAW
and a short
course at Banff, Nov 2006.
- Invited talk at the
International Workshop
on Intelligent Information Access, Helsinki, July 2006.
- Invited talk at the ICML 2005 workshop on Learning in Web Search.
- Invited talk at the ICML 2005 workshop on
Learning and Extending Lexical Ontologies
by using Machine Learning Methods.
- Panel discussion on exploiting dynamic
networking effects in Web advertising at
WWW 2005.
- Invited talk and position paper at
ECML/PKDD
in Pisa, Sept. 2004.
- Short course on
machine learning for hypertext applications at
ADFOCS
in Saarbrücken, Sept. 2004.
- Graph
structures in data mining. A tutorial presented at
SIGKDD
2004 with Christos
Faloutsos.
-
Text search for
fine-grained semi-structured data.
A tutorial presented at VLDB 2002.
-
Beyond hubs and authorities: spreading out and zooming in.
Invited talk at
ICDT International Workshop
on Web Dynamics, London, Jan. 2001.
-
Data Mining and Learning on the Web. NIPS Workshop, Denver,
Dec. 2000. By invitation.
-
Nurturing
content-based collaborative communities on the Web.
Invited talk at the Joint
SIGDAT
Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora
(EMNLP/VLC), Hong Kong, Oct. 7--8, 2000.
-
Hypertext data mining:
A tutorial presented at the
SIGKDD
Conference, Boston, August 2000.
- Hypertext databases and hypertext data mining.
SIGMOD 1999 Tutorial.
Patents
-
System and method for focussed web crawling.
-
Enhanced hypertext categorization using hyperlinks.
-
System and method for scheduling web servers with a
quality-of-service guarantee for each user.
-
Method for interactively creating an information database including
preferred information elements, such as, preferred-authority,
world wide web pages.
-
Method for cataloging, filtering, and relevance ranking frame-based
hierarchical information structures.
-
Multilevel taxonomy based on features derived from training documents
classification using fisher values as discrimination values.
-
System and method for mining surprising temporal patterns.
-
Feature diffusion across hyperlinks.
Links in areas of interest