PostgreSQL Project Suggestions

Project Ideas

Project suggestions from CS 631 offering in Spring 2012 are here .
Some project suggestions from CS 631 offering in Autumn 2010 are here .
The PostgreSQL TODO list, available at http://wiki.postgresql.org/wiki/Todo is a great source of ideas for PostgreSQL extensions. It is a very detailed list however, and it will take time to wade through it in detail, plus many of the ideas there are not suited for a course project. So here are a few ideas gleaned from that list which I think may be doable as a course project:
This list is being updated, visit again to find new suggestions
1. Temporal data support: (new) PostgreSQL 9.2 supports a new range type, and allows queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to be extended to check for temporal (non)overlap. Using this new feature, add support for syntax to declare relations as temporal, upon which you will add a new valid-time attribute for storing time range. Also perform query rewriting so that joins between two temporal relations are turned into joins with the ranges taken into account. If time permits extend to other operations such as projection (where you may have to merge ranges) and group by (exercise: figure out how to handle semantics of group by with time). For more on temporal data, read the relevant section of the Database Concepts book, and then explore more on the net.
2. Index tuning wizard Part A (new) The goal is to find which indices will best benefit a given workload (set) of queries. Your program will take as input a set of queries/updates, and a list of potential indices, and for each of the indices, find out its benefit. To do so, you will create a fake index, by modifying the index creation code to add the index to the catalog, but not actually create it. Then find the optimal plan for each query by invoking the optimizer. The modification to the database itself is to add support (starting from syntax) to create and drop fake indices. Then external to the database, you invoke explain plan to find the cost of each of the queries before and after adding each fake index, to find the benefit of the index for the query (can be negative for updates). Add up benefits across all queries to find overall benefit of the index. Choose the index that gives maximum benefit, and if the benefit is positive, the index is created (in fake mode). This process can be repeated to choose further indices. At the end drop all the fake indices, and output the benefits found for each index when it was chosen, and output a script to create the actual indices.
3. Index tuning wizard Part B (new) This part goes over the queries in a workload, and finds candidate indices i.e. indices that have a potential to benefit one or more queries in the workload. Candidate indices are those attributes on which there are selection predicates, and those attributes that participate in join conditions. If selections/joins involve multiple attributes, candidate indices can correspondingly be on multiple attributes. Your code should also work with subqueries, and correlation conditions in subqueries. Add intelligence to eliminate indices where the statistics show that there are too many duplicate values, so the index is not likely to be beneficial. Also add intelligence to merge indices, e.g if query 1 needs an index on A, and query 2 on {A,B} (i.e. order is not relevant), then a single index on A,B will suffice (if query 2 needs an index on (B,A) in that order, then we cannot merge the indices). You will have to deal primarily with the parser and the statistics module. If time permits also also add indices that can support index only plans. E.g. given an index on attributes (A,B) a query SELECT B FROM r WHERE A = 4 can run without accessing the relation, using only the index. Thus you should consider the index (A,B) for this query even though B is not used in a selection or join. Index only plans were not there earlier in PostgreSQL, but are have been added in PostgreSQL version 9.2 (see http://wiki.postgresql.org/wiki/Index-only_scans). The problem with using index-only scans is that certain tuples may not be visible to certain transactions due to snapshot isolation, and the check for visibility of a tuple requires accessing the tuple. PostgresQL 9.2, which was released on Sep 10, 2012, includes some clever tricks to make index-only scans work by keeping track of which pages have only tuples that are visible to all transactions. PostgreSQL 9.2 can be downloaded from the PostgreSQL web site.
4. Optimize or-is-null join (updated from last semesters version) For a keyword querying paper, we had to generate queries of the form SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null. Such a query can be broken up and executed as SELECT * from R, S WHERE R.A=S.A UNION SELECT * from R, S WHERE R.A is null UNION SELECT * from R, S WHERE S.A is null. but this requires multiple scans of the relations. Instead, modify one of join operators (say merge join) to directly handle such a join condition efficiently. You may assume that the number of tuples with null value will fit in memory. This project was done by two groups last year, for single attributes. Extend their code to handle multiple attributes, where some subset may be null.
5. Row level security (changed from last semester) PostgresQL 9.2 has added a new feature called security barrier and leakproof; see http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php section on Security barriers and Leakproof. This solves a problem reported earlier in the post: http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php The goal of this project is to implement some subset of the syntactic constructs we have defined in an ICDE 2007 paper http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf for grants with predicates, implemented by with query rewriting using views. Use the new security_barrier feature when defining the views. (Side note: We have a paper on this topic in SIGMOD 2006, which you can find here, along with a talk here if you are interested.)
6. Division operation (new) Implement the relational algebra division operator using sorting. Also add syntactic support to SQL to allow users to specify the division operator (think of it as a variation of the outer join operation, and follow the code for outerjoin to figure out how to modify the parser and planner so that the operation is added to the execution plan). You may benefit from reading the code for the extended join operator (or-is-null, above) from last year, which had to do similar modifications.
7. Table partitioning (new) Many database support table partitioning, where the rows are partitioned based on some condition, such as year, and stored separately. A query that specifies a year need only access one partition, and not others, and can thus run much more efficiently. Indices are created separately on each partition. Such partitioning is particularly useful for enterprise information systems, where most queries access only relatively recent data (typically the past year or two at most). (Queries that span partitions of course have to access all partitions, and thus pay a price.) Some databases support specific syntax for declaring a table as partitioned (see Oracle or SQL Server documentation for syntax). PostgreSQL does not have direct syntax for this, but provides a way to implement it using table inheritance. See http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html for details. Queries that need to access only one partition (e.g. on year) get rewritten to run on only that subtable, not the entire table. The goal of this project is to provide syntactic support for partitioning along the lines of Oracle or SQL Server, and to automatically create the required subtables and constraints, and triggers to retarget inserts to the correct subtable. See the above web page for details. You should also create new "system" relations to track the metadata for partitioned tables, so one can check if a table is partitioned, and on what conditions, and what are the subtables. Also, when indices are created on the main table, automatically create the same index on subtables (to do this, modify the index creation code to check if the table is partitioned, and if so create indices on the subtables instead).
8. Monitor row and cost estimates (dropped since it is already available using EXPLAIN (analyze,buffers) query, but do try it out for fun)