CS 632: Advanced DBMS

S. Sudarshan

Spring 2017  

All students must sign up for CS 632 on Piazza; details are on the CS 632 Moodle page.

Previous offerings: 2016, 2015, 2014, 2013, 2011, 2010, 2009, 2007, 2006, 2004, 2003, 2002, 2001, 2000, 1999.

About The Course

Reading material will consist primarily of research papers. All students will have to present a research paper of their choice, either from the list below or other papers subject to instructors approval. There will also be two exams (midsem/endsem), assignments, and a course project.

Grading scheme: Anyone who does an exceptional course project that has the potential to be a publishable paper is eligible for a straight AA grade. Otherwise the grading breakup would be midsem 25, endsem 40, project 20 and assignments plus seminar presentation 15 (the breakup of these will depend on whether we have individual or joint seminars, which depends on the final enrollment).
Midsem paper and End sem paper from 2016

Assignments To be decided.

Project The project is mandatorily an implementation oriented project. You may still need to do some literature survey to figure out your project though. Projects should be done in groups of 2. A basic project will take any of the papers we study in the course, or other related papers, and implement the algorithms in the paper, and do a very basic performance study. However, I would expect most projects to improve upon existing techniques. A more advanced project would take a problem specification for which no solution is publicly available, figure out how to solve it, and implement the solution.

Textbook (for background material only): Database System Concepts, 6th Ed.
Avi Silberschatz, Hank Korth, and S. Sudarshan. McGraw Hill, 2010.
(book home page)

Topic/Paper Schedule
Massively Parallel Data Management Systems (a.k.a. Big Data Systems)
0. Parallel and Distributed Databases. Will supply reading material. Slides: Chapter 18: Parallel Databases, and Chapter 19: Distributed Databases
1. Parallel and Distributed Data Storage Talk on massively parallel data storage ...
2. Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, OSDI 06)
Video of talk by Jeff Dean: Local mp4 copy OR on video.google.com

Related papers, not required reading:
Talk (ppt)
3. PNUTS: Yahoo!'s Hosted Data Serving Platform,
Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni.
VLDB (industry track) 2008.

Related papers, not required reading:
  • database implementation on S3 (Brantner et al SIGMOD 2008)
VLDB Talk by Brian Cooper (ppt)
4. Asynchronous view maintenance for VLSD databases
Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Raghu Ramakrishnan, SIGMOD 2009

Old talk (odp) and (pdf)
Related papers, not required reading:
  • The Megastore paper (see above), to understand how it does asynchronous maintenance of indices.
Talk (ppt)
5. Spanner: Google's Globally-Distributed Database
James C. Corbett et al., OSDI 2012
Talk (pptx)
6. Pregel: a system for large-scale graph processing
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski, SIGMOD 2010.
Talk (pptx) by (Gaurav Malpani and Mayank Singhal)
Related papers, not required reading:
Talk (pptx)
7. Calvin: Fast Distributed Transactions for Partitioned Database systems Systems
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi.
Talk (pptx)
Query Optimization
8. Rule-Based Query Optimization using the Volcano Framework.
Chapter 2 from Multiquery Optimization and Applications,
Prasan Roy, PhD thesis, IIT Bombay, 2000.

Related papers, not required reading:
Talk (ppt)
9. Efficient and Extensible Algorithms for Multi-Query Optimization,
Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobhe,
In ACM SIGMOD Conf. on the Management of Data., 2000.
Talk (ppt)
10. Incorporating Partitioning and Parallel Plans into the SCOPE Optimizer
Jingren Zhou, Per-Ake (Paul) Larson, and Ronnie Chaiken,
in Proc. of the Int'l Conf. on Data Engineering (ICDE), 2010.
Related papers, not required reading:
Talk (pptx)
Adaptive Query processing
11. Robust Query Processing through Progressive Optimization,
Volker Markl, Vijayshankar Raman, David E. Simmen, Guy M. Lohman, Hamid Pirahesh, SIGMOD 2004: 659-670
Talk (ppt)
12. Plan bouquets: query processing without selectivity estimation.
Anshuman Dutt, Jayant R. Haritsa, SIGMOD Conference 2014: 1039-1050
Talk (ppt)
Main-Memory Databases
13. Hekaton: SQL server's memory-optimized OLTP engine.
Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, Mike Zwilling
SIGMOD Conference 2013: 1243-1254
Talk has been put up on Moodle site
14. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
Viktor Leis, Peter Boncz, Alfons Kemper, Thomas Neumann, SIGMOD 2014
Papers after this will see some change in Spring 2017. (Papers before this may also see some minor changes.)
Streaming Data
15. Monitoring Streams - A New Class of Data Management Applications,
Donald Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, Stanley B. Zdonik
VLDB 2002: 215-226
Talk in 2011 (pptx) by Joydip Datta and Debarghya Majumdar
, and Talk in 2013 (pptx) by Ajay Gupta,Vinit Deodhar Talk from 2015 (pptx) by Kuldeep Sharma

Related papers, not required reading
  • Aurora: A New Model and Architecture for Data Stream Management.
    Abadi, D. J., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S.
    The VLDB Journal 12 (2003), 120-139.
  • Abadi, D., Ahmad, Y., Balazinska, M., Cetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. The Design of the Borealis Stream Processing Engine. In Proceedings of the 2nd Conference on Innovative Databasee Research (CIDR) (Jan. 2005), pp. 277-289.
  • Models and issues in data stream systems,
    Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom PODS 2002
  • Physically Independent Stream Merging
    Badrish Chandramouli, David Maier and Jonathan Goldstein, ICDE 2012
    Talk (pdf) by Amol Bhangdiya and Pushkar Khadilkar Talk from 2014 (pptx) by Bharath Radhakrishnan
Talk (pptx)
Also an overview talk on data streams: PODS 2002 talk by Motwani
16. MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle
VLDB 2013
Talk presented in 2014 by Mohit Sirohi
Related papers
Talk (pptx) by Mandar Pawar
RDF Processing
17. RDF-3X: a RISC-style Engine for RDF
Thomas Neumann, Gerhard Weikum, VLDB 2008
Talk in 2013 (pptx) Talk in 2014 (odp) Talk in 2015 (odp)
Talk by Indradyumna Roy (pptx)
Query Optimization: Beyond The Box
18. Program Transformations for Asynchronous and Batched Query Submission,
Karthik Ramachandra, Mahendra Chavan, Ravindra Guravannavar, and S. Sudarshan,
IEEE Trans. on Knowledge and Data Engineering (TKDE), pp. 531-544, Vol. 27, No. 2, Feb 2015
Related papers, not required reading:
Talk 1: Overview,
Talk 2: Detailed talk by Neha Garg in pdf and in odp
Streaming Data (Cont.)
19. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
T. Akidau et al.
PVLDB 2015
Related material online (articles and video):
  1. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  2. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
  3. Youtube video on Google Data Flow
Talk by Samkit Shah
IR and DB
20. Keyword Searching and Browsing in Databases using BANKS
Gaurav Bhalotia, Charuta Nakhe, Arvind Hulgeri, Soumen Chakrabarti and S. Sudarshan, ICDE 2002

Related papers, not required reading:
Overview talk (ppt), Talk by Pooja Agrawal (odp)
Declarative Data Processing (outside of databases)
21. Declarative Networking
Boon Thau Loo, Tyson Condie, Minos Garofalakis, David E. Gay, Joseph M. Hellerstein, Petros Maniatis, Raghu Ramakrishnan, Timothy Roscoe, and Ion Stoica
CACM 52(11), Nov 2009
Talk (pptx) by Harsh Vardhan and Sandeep Joshi
Talk 2014 (pdf) (Akshay Bapat)
Talk (Kuldeep Punjabi)
22. CrowdDB: answering queries with crowdsourcing Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh and Reynold Xin
Talk 2014 (and 2014 talk sources) (Tarun Kathuria)
Talk by Pratyaksh Sharma
Test Data Generation
23. Generating Test Data for Killing SQL Mutants: A Constraint-based Approach,
Shetal Shah, S. Sudarshan, Suhas Kajbaje, Sandeep Patidar, Bhanu Pratap Gupta, Devang Vira, ICDE 2011
Talk (ppt) by Shetal Shah in 2013
Related papers, not required reading: Talk (ppt) from 2014
Talk (pdf) by Saurabh Sarda
24. Discussion on future of data management.