COMAD 2005 TUTORIALS
Querying and Mining Data Streams:
You Only Get One Look
Rajeev Rastogi (Lucent
DURATION: 1.5 Hours
ABSTRACT:
Continuous data streams arise naturally, for
example, in the network installations of large Telecom and Internet service
providers where detailed usage information from different parts of the network
needs to be continuously collected and analyzed for interesting trends. This
tutorial will provide a comprehensive and clear overview of the key research
results surrounding data stream processing at this point in time.
Our discussion will be structured as follows.
* Introduction: Basic stream-processing models and architectures; motivating
applications.
* Basic Stream Summarization Algorithms: Samples, quantiles/histograms,
sketches, wavelets over streaming data.
* Processing Queries on Streams: Using sketches for self-joins, binary joins,
and complex joins over data streams; estimating correlated aggregates; using
histogram and wavelet synopses for approximate-query processing.
* Mining High-speed Data Streams: Single-pass algorithms for rule discovery,
clustering, and decision-tree construction over streams.
* Advanced Topics and Future Research Directions: Hot-list maintenance;
distinct-value estimation; multi-dimensional synopses; content-based filtering
of streaming XML documents.
This tutorial is targeted at researchers and practitioners who want to obtain a
solid understanding of the state-of-the-art in stream query processing and
analysis.
PRESENTER BIO:
Rajeev Rastogi
(PhD 1993, U.Texas-Austin) is the Executive Director
of Bell Labs Research in
Tutorial T2
Approximate Query Processing Techniques
Gautam Das (U.Texas -
DURATION:
1.5
Hours
ABSTRACT:
In recent years, advances in data collection
and management technologies have led to a proliferation of very large
databases. However, effective data analysis such as data mining and decision
support on such multi-gigabyte repositories has proven difficult to achieve.
This is primarily because most analysis queries, by their nature, require
aggregation or summarization of large portions of the data. Processing even a
single analysis query involves accessing enormous amounts of data, leading to
prohibitively expensive running times.
While keeping query response times short is very important in such
applications, exactness in query results is frequently less important. In many
cases, ``ballpark estimates'' are adequate to provide the desired insights
about the data. The acceptability of inexact query answers coupled with the
necessity for fast query response times has led researchers to investigate
Approximate Query Processing techniques that sacrifice accuracy to improve
running time, typically through some sort of lossy
data compression.
In this tutorial we survey many of the recent approximate query processing
techniques that have been developed in recent years, such as online
aggregation, systems based on precomputed random
samples, and systems based on wavelet and histogram data structures. Our focus
is to
illustrate the fundamental principles that underline these various approaches,
rather than attempt an exhaustive survey.
PRESENTER
BIO:
Gautam Das (PhD 1990, U.Wisconsin-Madison) is a Professor at
Tutorial T3a+T3b
The Continued Saga of DB-IR Integration
Ricardo Baeza-Yates (U. of
DURATION: 3 Hours
ABSTRACT:
The world of data has been developed from two
main points of view: the structured relational data model and the unstructured
text model. The two distinct cultures of database and information retrieval now
have a natural meeting place in the Web with its semi-structured XML model. As
web-style searching becomes the ubiquitous tool, the need for integrating these
two viewpoints becomes even more important. In this tutorial we explore the
differences, the problems and the techniques for DB-IR integration for a range
of applications. The tutorial will provide an overview of the different
approaches put forward by the IR and DB communities and survey the DB-IR
integration efforts. Both earlier proposals as well as recent ones (in the
context of XML in particular) will be discussed. A variety of application
scenarios for DB-IR integration will be covered. The objective of this tutorial
is to provide an overview of the issues and approaches developed for the
integration of database and information retrieval systems. The target audience
of this tutorial includes researchers in database systems, as well as
developers of Web and database/information retrieval applications.
PRESENTER BIOS:
Ricardo Baeza-Yates(PhD
1989, U.Waterloo) is a professor at the Computer
Science Department of the
Tutorial T4
Web Information Retrieval
Krishna Bharat (Google
India)
DURATION: 1.5 Hours
ABSTRACT:
Search engines have changed the way people
access information in their daily lives. They find needles in massive,
electronic haystacks in the blink of an eye. Web information retrieval is a new
field that extends the science of search to terabyte scale, online collections.
This tutorial will cover:
- The similarities and differences between classical and web information
retrieval
- The development of link-based retrieval methods in conjunction with
text-based methods for searching the web
- Search engine measurement
- Infrastructure for processing the web graph and computing on web-scale
collections
PRESENTER BIO:
Krishna Bharat (PhD 1996,
Georgia Tech) heads Google's new R&D Centre in Bangalore, India. Google Engineering is expanding globally. Recently, Google R&D Centres were
established in