PostgreSQL Project Suggestions

Project Ideas

Project suggestions from CS 631 offering in past semester: Autumn 2012 , Spring 2012 , and Autumn 2010 .
The PostgreSQL TODO list, available at http://wiki.postgresql.org/wiki/Todo is a great source of ideas for PostgreSQL extensions. It is a very detailed list however, and it will take time to wade through it in detail, plus many of the ideas there are not suited for a course project. So here are a few ideas gleaned from that list which I think may be doable as a course project:
This list is being updated as of Monday Sep 16, 2013, visit again to find new suggestions
1. Temporal data support: (new) PostgreSQL 9.2 supports a new range type, and allows queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to be extended to check for temporal (non)overlap. Using this new feature, add support for syntax to declare relations as temporal, upon which you will add a new valid-time attribute for storing time range. Also perform query rewriting so that joins between two temporal relations are turned into joins with the ranges taken into account. If time permits extend to other operations such as projection (where you may have to merge ranges) and group by (exercise: figure out how to handle semantics of group by with time). For more on temporal data, read the relevant section of the Database Concepts book, and then explore more on the net.
2. Optimize or-is-null join (updated from last semesters version) For a keyword querying paper, we had to generate queries of the form SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null. Such a query can be broken up and executed as SELECT * from R, S WHERE R.A=S.A UNION SELECT * from R, S WHERE R.A is null UNION SELECT * from R, S WHERE S.A is null. but this requires multiple scans of the relations. Instead, modify the hash-join operator to directly handle such a join condition efficiently. This project was partly implemented as part of an MTech project last year. Extend the existing code to better handle multiple attributes, and remove some inefficiencies. An external research scholar, Pankaj Vanwari (pankajvanwari@gmail.com) can help with some of the details. This project was done earlier by grouping tuples into 2^N groups (where N is the number of attributes). A tuple that belongs to group 101 has attribute B as null whereas A and C are not null. Joining tuples is as per its group, say tuples of group 101 joins with tuples of group 110 on A (Since 101 “AND ” 110 = 100). It is expected that the project builds only the required indices on all groups (by possibly doing a sequential scans on both relations to first find all groups) and then join a tuple with all relevant groups of the other relation using appropriate indices. If number of tuples in a group are more then we may need to partition the group into pieces that fit into memory; this extension can be done if time permits, for now you can assume that each groups fit in memory.
3. Row level security (changed from last semester) PostgresQL 9.2 has added a new feature called security barrier and leakproof; see http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php section on Security barriers and Leakproof. This solves a problem reported earlier in the post: http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php The goal of this project is to implement some subset of the syntactic constructs we have defined in an ICDE 2007 paper http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf for grants with predicates, implemented by with query rewriting using views. Use the new security_barrier feature when defining the views. (Side note: We have a paper on this topic in SIGMOD 2006, which you can find here, along with a talk here if you are interested.)
4. Division operation (new) Implement the relational algebra division operator using sorting. Also add syntactic support to SQL to allow users to specify the division operator (think of it as a variation of the outer join operation, and follow the code for outerjoin to figure out how to modify the parser and planner so that the operation is added to the execution plan). You may benefit from reading the code for the extended join operator (or-is-null, above) from last year, which had to do similar modifications.
5. Table partitioning A (new) Many database support table partitioning, where the rows are partitioned based on some condition, such as year, and stored separately. A query that specifies a year need only access one partition, and not others, and can thus run much more efficiently. Indices are created separately on each partition. Such partitioning is particularly useful for enterprise information systems, where most queries access only relatively recent data (typically the past year or two at most). (Queries that span partitions of course have to access all partitions, and thus pay a price.) Some databases support specific syntax for declaring a table as partitioned (see Oracle or SQL Server documentation for syntax). PostgreSQL does not have direct syntax for this, but provides a way to implement it using table inheritance. See http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html for details. Queries that need to access only one partition (e.g. on year) get rewritten to run on only that subtable, not the entire table. The goal of this project is to provide syntactic support for partitioning along the lines of Oracle or SQL Server, and to automatically create the required subtables and constraints, and triggers to retarget inserts to the correct subtable. See the above web page for details. You should also create new "system" relations to track the metadata for partitioned tables, so one can check if a table is partitioned, and on what conditions, and what are the subtables. Also, when indices are created on the main table, automatically create the same index on subtables (to do this, modify the index creation code to check if the table is partitioned, and if so create indices on the subtables instead).
6. Table partitioning B: partitioning for parallelism New for this year, the goal is to extend table partitioning to handle multiple tables which are related, in the context of a parallel database. Partitioning is specified for one of the tables, lets call it the master table. Each partition is to be placed on a different server. Other tables that are linked to this table by foreign keys referencing the master table then get partitioned automatically, based on the partitioning of the master table. E.g. if the master table is user, a table photos which is linked to the user by a foreign key will get partitioned with the user. Tables referenced by the master table get logically replicated to all partitions. For example, a table user_category referenced by user can be assumed to be replicated at all nodes. The goal of this project could be any one of the following:
  1. to perform query rewriting, so a query only accesses partitions that it needs to access. You should be able to handle join queries.
  2. to implement the partitioning by sending insert/update/read requests to an appropriate copy of PostgreSQL. No need to do any query rewriting, you can restrict attention to single table selection queries, but you should have at least two copies of PostgreSQL working together, sending requests to each other.