PostgreSQL Project Suggestions

Project Ideas

Project suggestions from CS 631 offering in past years: Autumn 2014 , Autumn 2013 , Autumn 2012 , Spring 2012 , and Autumn 2010 .

The PostgreSQL TODO list, available at http://wiki.postgresql.org/wiki/Todo is a great source of ideas for PostgreSQL extensions. It is a very detailed list however, and it will take time to wade through it in detail, plus many of the ideas there are not suited for a course project. So here are a few ideas gleaned from that list which I think may be doable as a course project:

To learn more about POstgreSQL, visit http://www.postgresql.org/docs/current/static/index.html; this site has a tremendous amount of detail, so you cannot read it all, but you should read the parts relevant to your project.
This list is being updated as of Aug 16, 2015, visit again to find new suggestions

Write Optimized B+-trees(new)
We have discussed write optimizer B+-trees in class. See the paper: Goetz Graefe: Sorting and Indexing with Partitioned B-Trees. Conference on Innovative Data Systems Research, 2003, (down load it from http://d-nb.info/99142929X/34 ) and in particular Section 3.4 which describes how to implement efficient insert operations by using an extra leading attribute which identifies the partition. Implement this idea on PostgreSQL by adding a new index technique, which acts as a layer on top of a norml B+-tree mplementation.
Operating on HBase Tables (new)
PostgreSQL like other databases requires data to be loaded into the database before it can be queried. See the project on Operating Directly on Text Files for more information on how to get access to text files from PostgreSQL. But the goal of this project is more ambitious: it is to allow direct access to data in HBase from PostgreSQL. You will have to make some assumptions about the relations, just as in the case of operating directly on text files. But in addition to being able to scan files, HBase provides an index on the stored relation, which you should ideally be able to use. Implement an index wrapper, which acts like a PostgreSQL index, but in reality just uses HBase.
Operating Directly on Text Files (2014)
There were some implementations of this last year; some implementations did not modify the underlying scan operators, and thus didn't meet the requirement. Some did; you are welcome to get code from your seniors (in consultation with your TAs), but do more on proper integration, including cost models for optimization and such. PostgreSQL like other databases requires data to be loaded into the database before it can be queried. There are many data sources with textual data such as log files which can be parsed and loaded into the database, but this increases the storage cost greatly, in addition to requiring significant load time. Your task here is to allow PostgreSQL to define a view/relation that does not store data, but instead reads data from a file (say a comma-separated values (CSV) file) when required. For more details, see
- NoDB in Action: Adaptive Query Processing on Raw Data Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos and Anastasia Ailamaki, VLDB 2013, and
- I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, pages 241-252, 2012.
These papers describe a lot of optimizations; you do not have to implement all or even most, a basic working engine with a few simple optimizations or even no optimization is what I expect out of a course project.
Temporal data support:
PostgreSQL 9.2 supports a new range type, and allows queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to be extended to check for temporal (non)overlap. Using this new feature, add support for syntax to declare relations as temporal, upon which you will add a new valid-time attribute for storing time range. Also perform query rewriting so that joins between two temporal relations are turned into joins with the ranges taken into account. If time permits extend to other operations such as projection (where you may have to merge ranges) and group by (exercise: figure out how to handle semantics of group by with time). For more on temporal data, read the relevant section of the Database Concepts book, and then explore more on the net.
Optimize or-is-null join (old, done by several teams, but you can still take code from earlier teams and add new features)
For a keyword querying paper, we had to generate queries of the form SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null. Such a query can be broken up and executed as SELECT * from R, S WHERE R.A=S.A UNION SELECT * from R, S WHERE R.A is null UNION SELECT * from R, S WHERE S.A is null. but this requires multiple scans of the relations. Instead, modify the hash-join operator to directly handle such a join condition efficiently. This project was partly implemented as part of an MTech project last year. Extend the existing code to better handle multiple attributes, and remove some inefficiencies. An external research scholar, Pankaj Vanwari (pankajvanwari@gmail.com) can help with some of the details. This project was done earlier by grouping tuples into 2^N groups (where N is the number of attributes). A tuple that belongs to group 101 has attribute B as null whereas A and C are not null. Joining tuples is as per its group, say tuples of group 101 joins with tuples of group 110 on A (Since 101 “AND ” 110 = 100). It is expected that the project builds only the required indices on all groups (by possibly doing a sequential scans on both relations to first find all groups) and then join a tuple with all relevant groups of the other relation using appropriate indices. If number of tuples in a group are more then we may need to partition the group into pieces that fit into memory; this extension can be done if time permits, for now you can assume that each groups fit in memory.
Row level security (was there last year, but no one did it so it counts as new)
PostgresQL 9.2 has added a new feature called security barrier and leakproof; see http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php section on Security barriers and Leakproof. This solves a problem reported earlier in the post: http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php The goal of this project is to implement some subset of the syntactic constructs we have defined in an ICDE 2007 paper http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf for grants with predicates, implemented by with query rewriting using views. Use the new security_barrier feature when defining the views. (Side note: We have a paper on this topic in SIGMOD 2006, which you can find here, along with a talk here if you are interested.)

Negative Topic List (I don't want projects on these topics since they have beend one recently)

In-memory Column Store (Deleted)
Columnar representations have become very popular for in-memory data due to the opportunities for compression that they offer. An in-memory columnar representation has been implemented as a patch on PostgreSQL already, and you can find more info at http://www.pgcon.org/2014/schedule/attachments/322_IMCS.pdf. But you can still implement a subset of the features described in the above presentation on your own.
Division operation Deleted
This has been done by many groups over the past few years, so I don't want yet another implementation.
Cube/Rollup/Grouping Sets extensions to group by
This has been done by many groups over the past few years, so I don't want yet another implementation.
Table partitioning A Deleted
This has been done by many groups over the past few years, so I don't want yet another implementation.
Table partitioning for parallelism
Table partitioning allows a table to be split into multiple pieces, which can be stored as separate relations on a single machine (last years project, now deleted since too many people did it), or on different nodes in a parallel system. For details on table partitioning in PostgreSQL, see https://wiki.postgresql.org/wiki/Table_partitioning. One goal is to extend table partitioning to handle multiple tables which are related, in the context of a parallel database. Partitioning is specified for one of the tables, lets call it the master table. Each partition is to be placed on a different server. Other tables that are linked to this table by foreign keys referencing the master table then get partitioned automatically, based on the partitioning of the master table. E.g. if the master table is user, a table photos which is linked to the user by a foreign key will get partitioned with the user. Tables referenced by the master table get logically replicated to all partitions. For example, a table user_category referenced by user can be assumed to be replicated at all nodes. The second goal of this project could be any one of the following:
1. to perform query rewriting, so a query only accesses partitions that it needs to access. You should be able to handle join queries.
2. to implement the partitioning by sending insert/update/read requests to an appropriate copy of PostgreSQL. No need to do any query rewriting, you can restrict attention to single table selection queries, but you should have at least two copies of PostgreSQL working together, sending requests to each other.