CS631: ITDBMS Project

Requirements

4 students/project (except in very special cases)
Make your own arrangements with respect to machines for developing/testing code.
Projects to be finalized before midsem exams.
We will have a project checkpoint about a month after midsem exams.

Note: papers from ACM or IEEE sites can be accessed through the IITB library web site (library.iitb.ac.in)

Old Projects (from 2005): Project ideas from 2005, and Project titles/groups from 2005.

PostgreSQL Projects

Note: If you choose to do one of these projects, first set up a copy of PostgreSQL source on your computer, compile it and get it running. Then start modifying it to implement your chosen project.

List is incomplete, will be adding more topics shortly

Materialized view definition and maintenance
Modify the code for defining views to also create an actual table. Also modify code to insert/delete/update tuples to also do view maintenance. (** tips on how to locate appropriate functions in pgsql will be added here).
At a minimum you should be able to handle select/project/join queries with simple join conditions and two relations. Ideally you should be able to handle any number of relations and aggregation also.
Don't worry too much about efficiency of your incremental view maintenance implementation, but extra marks for taking care of join orders in view maintenance.
Query optimization with materialized views
Modify the query optimizer code to consider rewriting the query using views. Look at Surajit Chaudhuri and Kyuseok Shim's paper on how to do this (**reference to be added).
Adaptive join algorithm
Modify the indexed nested loops join algorithm to do batch-wise processing if the number of results fetched increases beyond some preset limit. I.e. instead of doing an index lookup for each outer tuple, get a batch of them (for some predefined batch size), access the index to find record id, sort on record id, and fetch the records. Postgresql does something like this in its bitmap-index-scan + bitmap-scan optimization, but does not adapt. (**refs to work in this area to be added)
Publish-subscribe replication in PostgreSQL Implement a mechanism whereby an instance of postgresql can publish certain relations, and other instances can subscribe to it. In practical terms, the publisher keeps a list of subscribers and what they are subscribed to. Whenever an update happens on a subscribed relation, an update is sent to the subscriber.
You can assume that the update can be sent over a socket connection (which the subcriber should be listening for), or even over http (using a servlet at the subcriber).
Make sure that if the subscriber is down temporarily, when it can be accessed again a backlog of updates is sent (unless the backlog is too big). Note that replication is not transactional. To find out when a relation is updated, see the tips in the materialized view maintenance project. Note that updates by uncommitted transactions may be sent as a result. Extra credits for ensuring that only committed updates are sent to the subscriber.
Tip: Slony-I is a replication software designed for postgresql. It is not full featured compared to pub-sub support in commercial databases, but take a look at it to see what features it supports. (You don't have to implement all of them!)
Index Tuning Index tuning is a hard problem, but here you need to do only a part of the job. First figure out how to create hypothetical indices (i.e. indices that are not actually present, but you fool the optimizer into thinking they are there). Now given a query, or a set of queries, you can find out the estimated benefit of adding an index. Write simple heuristics to look at a query and decide what indices may be useful. Then write a simple heuristic for choosing good indices (a greedy heuristic is a good option). (**refs to papers on index tuning will be provided)
Query decorrelation for PGSQL (***) Queries with nested subqueries can be implemented using nested iteration. This is often very expensive, and many nested subqueries can be transformed to queries with joins/outerjoins. Read Galindo-Legaria/Joshi paper (see below) and implement their decorrelation technique in PostgreSQL. You will have to understand how PostgreSQL parses and represents queries and then add a stage just after parsing to perform decorrelation. Decorrelation can be done always, don't worry about whether it will provide a performance benefit.
Reference: Cesar A. Galindo-Legaria, Milind Joshi: Orthogonal Optimization of Subqueries and Aggregation. SIGMOD Conference 2001.
(new) Pivot Operation
Implement the pivot operation. Given, for example, a table R(A, B, C), where (A,B) forms a key for R, the operation pivot[A][B](R) gives a table with attributes (A, V1, V2, .., Vn), where V1..Vn are the distinct values that occur in column B. The tuples in the result containt for each A value, a value Ci in column Vi, if tuple (A, Vi, Ci) occurs in R. In general, we can have multiple columns in place of A, but B and C can be assumed to be single columns. Such an operation can give a table such as
```
Year   Q1   Q2   Q3  Q4
------------------------
2005   11   22   15  25
2006   10   20   18  27
```
when we apply pivot[Year][Quarter](sales) on a relation sales of the form
```
Year  Quarter  Sales
2005     Q1     11
2005     Q2     22
2005     Q3     15
...
```
(new) Array Index
Although B-tree indices are quite efficient, they cannot beat the speed of array indexing. For example, suppose you wish to find the seats available in a particular train on particular day, from a relation avail(train, date, seats), if the data is organized as a 2-d array, with train numbers mapped to integers (using a separate lookup array) and dates similarly mapped to integers (assume only 60 days ahead are required), lookup will be very fast. This project is to build such an index, for lookups on two columns (at least, if you can make it more general), to find the value of another column (more generally, multiple other columns). Extend the indexing mechanisms of PostgreSQL to implement such an index. When data is updated, the index must also be updated. For the purpose of this project, don't worry about (a) index concurrency control or (b) recovery/logging. In fact you can simply build the index in memory, with nothing on disk, but make sure the index gets updated (and preferably, recreated if the database is restarted). Show how this index can be integrated with the optimizer, and get used in query plans, if time permits. If you are out of time, simply show a direct lookup on this index.

CS631: ITDBMS Project

Requirements

PostgreSQL Projects

Generic Internals Projects