COMAD 2005 TUTORIALS

Querying and Mining Data Streams:
You Only Get One Look
Rajeev Rastogi (Lucent Bell Labs)

DURATION: 1.5 Hours

ABSTRACT:
Continuous data streams arise naturally, for example, in the network installations of large Telecom and Internet service providers where detailed usage information from different parts of the network needs to be continuously collected and analyzed for interesting trends. This tutorial will provide a comprehensive and clear overview of the key research results surrounding data stream processing at this point in time.

Our discussion will be structured as follows.

* Introduction: Basic stream-processing models and architectures; motivating applications.

* Basic Stream Summarization Algorithms: Samples, quantiles/histograms, sketches, wavelets over streaming data.

* Processing Queries on Streams: Using sketches for self-joins, binary joins, and complex joins over data streams; estimating correlated aggregates; using histogram and wavelet synopses for approximate-query processing.

* Mining High-speed Data Streams: Single-pass algorithms for rule discovery, clustering, and decision-tree construction over streams.

* Advanced Topics and Future Research Directions: Hot-list maintenance; distinct-value estimation; multi-dimensional synopses; content-based filtering of streaming XML documents.

This tutorial is targeted at researchers and practitioners who want to obtain a solid understanding of the state-of-the-art in stream query processing and analysis.

PRESENTER BIO:
Rajeev Rastogi (PhD 1993, U.Texas-Austin) is the Executive Director of Bell Labs Research in Bangalore, India. He is a Bell Labs Fellow, and currently serves on the editorial boards of IEEE Transactions on Knowledge and Data Engineering, and Knowledge and Information Systems. His research interests include database systems, network management and knowledge discovery. His most recent research has focused on the areas of data stream analysis, data mining, XML publishing systems, and network discovery, monitoring and configuration.

Tutorial T2
Approximate Query Processing Techniques
Gautam Das (U.Texas - Arlington)

DURATION: 1.5 Hours

ABSTRACT:
In recent years, advances in data collection and management technologies have led to a proliferation of very large databases. However, effective data analysis such as data mining and decision support on such multi-gigabyte repositories has proven difficult to achieve. This is primarily because most analysis queries, by their nature, require aggregation or summarization of large portions of the data. Processing even a single analysis query involves accessing enormous amounts of data, leading to prohibitively expensive running times.

While keeping query response times short is very important in such applications, exactness in query results is frequently less important. In many cases, ``ballpark estimates'' are adequate to provide the desired insights about the data. The acceptability of inexact query answers coupled with the necessity for fast query response times has led researchers to investigate Approximate Query Processing techniques that sacrifice accuracy to improve running time, typically through some sort of lossy data compression.

In this tutorial we survey many of the recent approximate query processing techniques that have been developed in recent years, such as online aggregation, systems based on precomputed random samples, and systems based on wavelet and histogram data structures. Our focus is to
illustrate the fundamental principles that underline these various approaches, rather than attempt an exhaustive survey.

PRESENTER BIO:
Gautam Das (PhD 1990, U.Wisconsin-Madison) is a Professor at U. of Texas, Arlington, from Fall 2004. Prior to joining UTA, Dr. Das was a Researcher in the Data Management, Exploration and Mining (DMX) Group at Microsoft Research for five years. Dr. Das has also held positions at Compaq Corporation and the University of Memphis. Dr Das graduated with a B.Tech in computer science from IIT Kanpur, India, and with a Ph.D in computer science from the University of Wisconsin, Madison. Dr. Das's research interests span data mining, information retrieval, databases, approximate query processing, heterogeneous data sources, applied graph and network algorithms, and computational geometry. While at Microsoft Research he was extensively involved in the design and implementation of association mining algorithms, sampling based approximate query processing, and research in database/IR integration. Prior to Microsoft, he has worked in clustering of categorical data and time series similarity, and also in the classical algorithms areas of shortest paths, spanning networks and geometric visibility. He has published over 60 papers, many of which have appeared in premier data mining, database and algorithms conferences (e.g., SIGMOD, VLDB, ICDE, KDD, STOC, SODA, SoCG) as well as in several leading journals and invited book chapters. Dr. Das has served as the Program Chair of CIT 2004 and SIGMOD-DMKD 2004, as well as in program committees of premier conferences such as SIGMOD, ICDE, KDD, and ICML.

Tutorial T3a+T3b
The Continued Saga of DB-IR Integration
Ricardo Baeza-Yates (U. of Chile)

DURATION: 3 Hours

ABSTRACT:
The world of data has been developed from two main points of view: the structured relational data model and the unstructured text model. The two distinct cultures of database and information retrieval now have a natural meeting place in the Web with its semi-structured XML model. As web-style searching becomes the ubiquitous tool, the need for integrating these two viewpoints becomes even more important. In this tutorial we explore the differences, the problems and the techniques for DB-IR integration for a range of applications. The tutorial will provide an overview of the different approaches put forward by the IR and DB communities and survey the DB-IR integration efforts. Both earlier proposals as well as recent ones (in the context of XML in particular) will be discussed. A variety of application scenarios for DB-IR integration will be covered. The objective of this tutorial is to provide an overview of the issues and approaches developed for the integration of database and information retrieval systems. The target audience of this tutorial includes researchers in database systems, as well as developers of Web and database/information retrieval applications.

PRESENTER BIOS:
Ricardo Baeza-Yates(PhD 1989, U.Waterloo) is a professor at the Computer Science Department of the University of Chile, where he was the chair between 1993-95. He is also the director of the Center for Web Research. His research interests include information retrieval, algorithms, and information visualization. He is co-author of the book Modern Information Retrieval, published in 1999 by Addison-Wesley. In 2002 he was appointed to the Chilean Academy of Sciences, the first person from computer science to achieve this position in Chile.

Tutorial T4
Web Information Retrieval
Krishna Bharat (Google India)

DURATION: 1.5 Hours

ABSTRACT:
Search engines have changed the way people access information in their daily lives. They find needles in massive, electronic haystacks in the blink of an eye. Web information retrieval is a new field that extends the science of search to terabyte scale, online collections.

This tutorial will cover:
- The similarities and differences between classical and web information retrieval
- The development of link-based retrieval methods in conjunction with text-based methods for searching the web
- Search engine measurement
- Infrastructure for processing the web graph and computing on web-scale collections

PRESENTER BIO:
Krishna Bharat (PhD 1996, Georgia Tech) heads Google's new R&D Centre in Bangalore, India. Google Engineering is expanding globally. Recently, Google R&D Centres were established in Zurich, Bangalore and Tokyo to tap local engineering and linguistic expertise, transplant Google's culture of innovation, and meet the needs of a global user base. Krishna is a Principal Scientist at Google Inc, working in area of UI and algorithmic support for Web search and content analysis (Web Information Retrieval). He graduated with a Ph.D. in Computer Science from Georgia Tech in 1996 and a BTech from IIT-Madras in 1991. Before joining Google in 1999, he was a member of the research staff at DEC Systems Research Center in Palo Alto, CA. Recently, he created Google News (http://news.google.com/) a computer generated newspaper that unifies news from online newspapers worldwide with an emphasis on diversity and balance.