CS 632 Advanced Data Base(d) Management Systems Home Page

CS 632: Advanced DBMS

S. Sudarshan

Spring 2011

Previous offerings: 2010, 2009, 2007, 2006, 2004, 2003, 2002, 2001, 2000, 1999.

End paper and Midsem paper from 2009

About The Course

Reading material will consist primarily of research papers. All students will have to present a research paper of their choice, either from the list below or other papers subject to instructors approval. There will also be two exams (midsem/endsem), assignments, and a course project.

Anyone who does an exceptional course project that has the potential to be a publishable paper is eligible for a straight AA grade. Otherwise the grading breakup would be midsem 25, endsem 40, project 20 and assignments plus seminar presentation 15 (the breakup of these will depend on whether we have individual or joint seminars, which depends on the final enrollment).

Assignments The first set of assignments are on setting up Hadoop and running some simple map-reduce jobs. These will be in groups of two. Watch the moodle page for assignment details.

Project This year the project is mandatorily an implementation oriented project, unlike previous years where a literature survey was acceptable as a project. (You may still need to do some literature survey to figure out your project though.) Projects should be done in groups of 2.

A basic project will take any of the papers we study in the course, or other related papers, and implement the algorithms in the paper, and do a very basic performance study. However, I would expect most projects to improve upon existing techniques.

A more advanced project would take a problem specification for which no solution is publicly available, figure out how to solve it, and implement the solution.

Textbook (for background material only)

Database System Concepts, 6th Ed.
Avi Silberschatz, Hank Korth, and S. Sudarshan. McGraw Hill, 2010.
(book home page)

Query Optimization

Rule-Based Query Optimization using the Volcano Framework.
Chapter 2 from Multiquery Optimization and Applications,
Prasan Roy, PhD thesis, IIT Bombay, 2000.
Talk (ppt) (Jan 4, 2011)
Efficient and Extensible Algorithms for Multi-Query Optimization,
Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobhe,
In ACM SIGMOD Conf. on the Management of Data., 2000.
Talk (ppt) (Jan 7/11, 2011)
Rewriting Procedures for Batched Bindings
Ravindra Guravannavar and S. Sudarshan, VLDB 2008
Talk (ppt) (11/18 Jan 2011)
Related papers, not required reading:
- Optimizing Nested Queries with Parameter Sort Orders
  Ravindra Guravannavar, Ramanujam H.S., S. Sudarshan
  Talk (ppt)
- Reducing Order Enforcement Cost in Complex Query Plans
  Ravindra Guravannavar and S. Sudarshan, ICDE 2007
  Tech Report (@arxiv.org)
  Talk (ppt)
Query Processing for SQL Updates
Cesar A. Galindo-Legaria, Stefano Stefani, Florian Waas
SIGMOD Conference 2004: 844-849
talk (ppt) (21 Jan 2011)
Introduction to Hadoop, to get you started on the first assignment. Map Reduce Talk and Cloud Data Stores Talk
Adaptive Query Processing
Eddies: Continuously Adaptive Query Processing,
Avnur and Hellerstein, SIGMOD 2000.
(Talk (ppt)) (taken from http://web.cs.wpi.edu/~cs561/s05/talks/eddy-sigmod00-cs561.ppt)
(Adaptive Query Processing using Eddies (ppt) by Amol Deshpande) (Jan 25, 2011)
Robust Query Processing through Progressive Optimization,
Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M. Lohman, Hamid Pirahesh, SIGMOD 2004: 659-670
Talk (ppt) (Jan 28, 2011)
Massively Parallel Database/Storage Systems
Background reading: The parallel database chapter and the distributed database chapter from DB Concepts.
Slides: Chapter 18: Parallel Databases, and Chapter 19: Distributed Databases (plus 3PC, not available on book site) (Feb 1, 2011)
Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, Rares Vernica
(Talk (pptx)) by Vinayak Borkar (Feb 4, 2011)
Related papers, not required reading:
- MapReduce: Simplifed Data Processing on Large Clusters
  Jeffrey Dean and Sanjay Ghemawat, OSDI 2004,
  Talk
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, OSDI 06)
Video of talk by Jeff Dean: Local mp4 copy OR on video.google.com
Talk (ppt) (Feb 8, 2011)
Related papers, not required reading:
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services,
  Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh
  CIDR 2011
- You can also read about the Google AppEngines DataStore API, an API in Python.
PNUTS: Yahoo!'s Hosted Data Serving Platform,
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni.
VLDB (industry track) 2008.
VLDB Talk by Brian Cooper (ppt) (Feb 11, 2010)
Related papers, not required reading:
- database implementation on S3 (Brantner et al SIGMOD 2008)
Asynchronous view maintenance for VLSD databases
Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Raghu Ramakrishnan, SIGMOD 2009
Talk from 2011 (odp) and (pdf) Prashant Jaiswal and Ketan Mav (Feb 15, 2011)
Old talk from 2010 (ppt)
Related papers, not required reading:
- The Megastore paper (see above), to understand how it does asynchronous maintenance of indices.
Pregel: a system for large-scale graph processing
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski, SIGMOD 2010.
Talk (pptx) by (Gaurav Malpani and Mayank Singhal) (Feb 18, 2011)
Related papers, not required reading:
- Keyword Search on External Memory Data Graphs
  Bhavana Dalvi, Meghana Kshirsagar and S. Sudarshan,
  Proceedings of the VLDB 1(1), 1189-1204, 2008 ( ppt, with a minor correction)
Week of 21-26 Feb: Midsemester Exam (on Feb 26)

The list of papers below will get refined as we go along in 2011.
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou,
VLDB 2008
Talk (pptx) by Sapna Jain and R. Gokilavani (Mar 4, 2011)
Related papers, not required reading:
- Hive - a petabyte scale data warehouse using Hadoop,
  A. Thusoo, J. S. Sarma, N. Jain, Shao Zheng, P. Chakka, Zhang Ning, S. Antony, Liu Hao, and R. Murthy,
  ICDE 2010
- Pig Latin: A Not-So-Foreign Language for Data Processing
  Chris Olston, Brian Reed, Utarsh Srivastava, Ravi Kumar and Andrew Tomkins
  SIGMOD 2008, Talk (ppt)
- Incorporating partitioning and parallel plans into the SCOPE optimizer. Jingren Zhou, Per-Åke Larson, Ronnie Chaiken ICDE 2010: 1060-1071
IR and DB
Keyword Searching and Browsing in Databases using BANKS
Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, Soumen Chakrabarti and S. Sudarshan, ICDE 2002
(Long talk by Sudarshan, ppt), (Short talk by Ramdas, pdf) (Mar 1, 2011)

Related papers, not required reading:
- Bidirectional Expansion For Keyword Search on Graph Databases,
  Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, Rushi Desai and Hrishikesh Karambelkar, VLDB 2005
  (Talk: ppt)
Database Testing
Reverse Query Processing
Carsten Binnig, Donald Kossmann and Eric Lo, ICDE 2007,
Talk (pptx) by Ankit Shah and Bikash Chandra (Mar 8, 2011)
Old talk from 2010
Related papers, not required reading:
- Reverse Query Processing (Technical Report)
  Carsten Binnig, Donald Kossmann and Eric Lo, ETH Zurich, 2007
- QAGen: Generating Query-Aware Test Databases
  Carsten Binnig, Donald Kossmann, Eric Lo and M. Tamer. Ozsu, SIGMOD 2007
- Massive Stochastic Testing of SQL Donald R. Slutz
  VLDB 1998: 618-622
  Talk by Manan Shah
Generating Test Data for Killing SQL Mutants: A Constraint-based Approach,
Shetal Shah, S. Sudarshan, Suhas Kajbaje, Sandeep Patidar, Bhanu Pratap Gupta, Devang Vira, ICDE 2011 (to appear)
Talk (ppt) (Mar 11, 2011)
Related papers, not required reading:
- X-Data: Test Data Generation for Killing SQL Mutants
  Bhanu Pratap Gupta, Devang Vira and S. Sudarshan, ICDE 2010 (short paper). Talk at ICDE
Automating the Detection of Snapshot Isolation Anomalies
Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, S. Sudarshan
VLDB 2007: 1263-1274
Talk (in ppt, but created in OO) by Rupesh Bende and Viraj Churi (Mar 15, 2011)
Old talk from 2010
Related papers, not required reading:
- Making snapshot isolation serializable
  Alan Fekete, Dimitrios Liarokapis, Elizabeth O'Neil, Patrick O'Neil and Dennis Shasha
  ACM TODS, 30(2) June 2005
- Allocating Isolation Levels to Transactions, Alan Fekete, PODS 2005
Peer to Peer Systems
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications,
I. Stoica, R. Morris, D. Karger, M. Frans Kaashoek, H. Balakrishnan,
In Proc. ACM SIGCOMM 2001. Expanded version appears in IEEE/ACM Trans. Networking, 11(1), February 2003.
P2P overview talk (in pdf) and
Talk (pptx) by Anirban Basumallik and Somdas Bandyopadhyay (Mar 18, 2011))
Old talk from 2010
Related papers, not required reading:
- A Scalable Content-Addressable Network,
  S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker, In Proc. ACM SIGCOMM 2001)
- Querying the Internet with PIER
  Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, and Ion Stoica, VLDB 03
  (Talk:ppt)
Applications of Declarative Programming
Ajax-based report pages as incrementally rendered views
Yupeng Fu, Keith Kowalczykowski, Kian Win Ong, Yannis Papakonstantinou, Kevin Keliang Zhao , SIGMOD 2010
Talk (pptx) by Anjali Singhal and Manisha Naik Gaonkar (Mar 22, 2011)
Related papers, not required reading:
- Y. Fu, K. W. Ong, Y. Papakonstantinou, M. Petropoulos The SQL-based all-declarative FORWARD web application development framework
  Proc. Conference on Innovative Data Systems Research (CIDR), 2011.
  (slides)
- Do-It-Yourself custom forms-driven workflow applications.
  Keith Kowalczykowski, Kian Win Ong, Kevin Keliang Zhao, Alin Deutsch, Yannis Papakonstantinou, Michalis Petropoulos: CIDR 2009.
Scalability for Virtual Worlds
Nitin Gupta, Alan J. Demers, Johannes Gehrke, Philipp Unterbrunner, Walker M. White
ICDE 2009
Talk (ppt) by Siddharth Chinoy and Zibran Shaikh (Mar 22, 2011)
Related papers, not required reading:
- SEMMO: A Scalable Engine for Massively Multiplayer Online Games (Demonstration Paper) Nitin Gupta, Alan Demers, and Johannes Gehrke, SIGMOD 2008
- Database Research Opportunities in Computer Games Walker White, Christoph Koch, Nitin Gupta, Johannes Gehrke, and Alan Demers, In SIGMOD Record, September 2007.
Data Streams and Publish/Subscribe
Monitoring Streams - A New Class of Data Management Applications,
Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, Stanley B. Zdonik
VLDB 2002: 215-226
Talk (pptx) by Joydip Datta and Debarghya Majumdar (Mar 25, 2011)
You must also read this talk: (PODS 2002 talk by Motwani)
Related papers, not required reading
- Aurora: A New Model and Architecture for Data Stream Management.
  Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S.
  The VLDB Journal 12 (2003), 120-139.
- Abadi, D., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. The Design of the Borealis Stream Processing Engine. In Proceedings of the 2nd Conference on Innovative Databasee Research (CIDR) (Jan. 2005), pp. 277-289.
- Models and issues in data stream systems,
  Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom PODS 2002
  (PODS 2002 talk by Motwani)
- Finding the Frequent Items in Streams of Data
  Graham Cormode and Marios Hadjielftheriou, VLDB 2008 and CACM 52(1) Oct 2009
  Talk from 2010
- Query Processing, Resource Management, and Approximation in a Data Stream Management System
  Motwani, Widom, Arasu, Babcock, Babu, Datar, Manku, Olston, Rosenstein and Varma, CIDR 2003
- Feeding Frenzy: Selectively Materializing Users’ Event Feeds
  Adam Silberstein , Jeff Terrace , Brian F. Cooper , Raghu Ramakrishnan, SIGMOD 2010
Column Stores
Column-stores vs. row-stores: how different are they really?
Daniel J. Abadi, Samuel Madden, Nabil Hachem:
SIGMOD Conference 2008: 967-980
Talk from 2010 and Talk from 2011 (ppt) by Paresh Modak and Souman Mandal (Mar 29, 2011).
See also VLDB 09 tutorial on column stores by Hariozopoulos, Abadi and Boncz
Security and Privacy
Redundancy and Information Leakage in Fine-Grained Access Control,
Govind Kabra, Ravishankar Ramamurthy and S. Sudarshan
Talk from 2010 and Talk from 2011 (ppt) by Sriharsa Mohapatra and Dileep Singh (Mar 29, 2011)
Also: SIGMOD Talk, Overview of database security and an Overview of Finegrained Authorization

Other interesting papers on privacy, not covered this year:
- l-Diversity: Privacy Beyond k-Anonymity,
  Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer and Muthuramakrishnan Venkitasubramaniam
  Talk: ppt
- Mondrian Multidimensional K-Anonymity K. LeFevre, D. DeWitt, and R. Ramakrishnan. ICDE 2006
- Incognito: Efficient Full-Domain K-Anonymity.
  K. LeFevre, D. DeWitt, and R. Ramakrishnan, SIGMOD 2005.
- Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression,
  Pierangela Samarati and Latanya Sweeney,
  Procs. of the IEEE Symposium on Research in Security and Privacy, 1998.)
  Talk (pdf)
XML Query Processing
Structural Joins: A Primitive for Efficient XML Query Pattern Matching,
D. Srivastava, S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, Y.Wu, ICDE 2002.
Talk (odp) April 1, 2011
Related papers, not required reading:
- ORDPATHs: Insert-Friendly XML Node Labels,
  Patrick E. O'Neil, Elizabeth J. O'Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, Nigel Westbury, SIGMOD 2004: 903-908.)
- Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas and D. Srivastava, SIGMOD 2002.
  ppt
Uncertain and Probabilistic Data
OLAP Over Uncertain and Imprecise Data Douglas Burdick, Prasad Deshpande, T. S. Jayram, Raghu Ramakrishnan and Shivakumar Vaithyanathan, VLDB 2005
(Talk: Olap basics(pdf) and OLAP on uncertain/imprecise data (Talk from 2011 in ppt),
Talk1 from 2010 in (pdf) ) and Talk 2 in odp and in ppt
and by T. S. Jayram April 5, 2011
(Related material if you are interested, but not part of CS632:
- Probablistic Databases talk by Amol Deshpande on 12 Jan 2006)
- Probabilistic databases: diamonds in the dirt (and an extended version)
  Nilesh Dalvi, Christopher Re and Dan Suciu, CACM 52(7) July 2009
Current research directions in data management: A discussion (8 April 2011)

About The Course

Textbook (for background material only)

Query Optimization

Adaptive Query Processing

Massively Parallel Database/Storage Systems

Week of 21-26 Feb: Midsemester Exam (on Feb 26)

The list of papers below will get refined as we go along in 2011.

IR and DB

Database Testing

Peer to Peer Systems

Applications of Declarative Programming

Data Streams and Publish/Subscribe

Column Stores

Security and Privacy

XML Query Processing

Uncertain and Probabilistic Data