PostgreSQL Project Suggestions

Project Ideas

Some project suggestions from CS 631 offering in Autumn 2010 are here .
The PostgreSQL TODO list, available at http://wiki.postgresql.org/wiki/Todo is a great source of ideas for PostgreSQL extensions. It is a very detailed list however, and it will take time to wade through it in detail, plus many of the ideas there are not suited for a course project. So here are a few ideas gleaned from that list which I think may be doable as a course project:
1. Materialized views: there have been implementations based on creating triggers, including past CS 631 projects that create the triggers. You can take the past project (contact your TAs) and extend it, or do a project that directly modifies the update processing code to update views just like index maintenance.
  Another alternative is to modify the optimizer to introduce the materialized view when applicable, even if it is not explicitly used in the query (we discussed this in the query optimization chapter).
  For more information, see http://wiki.postgresql.org/wiki/Materialized_Views and also http://archives.postgresql.org/pgsql-hackers/2010-04/msg00479.php
2. Simplify text search syntax See http://archives.postgresql.org/pgsql-hackers/2007-11/msg00511.php, and the follow up replies. The syntax for text indexing with multiple columns also sucks, and needs improvement. Further, when using PostgreSQL we found that if the set of columns queried exactly matched an index, the index was used, else it wasn't. Check if this still the case in 9.1, and if not you can add the following feature: if the index contains a superset of the columns of the query, it can still be used, with an extra round of filtering. Add this feature if you have time.
3. B+-tree file organization Doing this right is actually non-trivial, but here is a way to cheat: add syntax to define a table as organized as a B+-tree clustered on specified attributes, but you implement it not by actually creating a B+-tree file organization, but by creating an index with the key containing those attributes followed by other specified attributes of the table. Then, make sure that queries that only access the attributes in the index, and can benefit from the clustering, use index-only plans. Index only plans were not there earlier in PostgreSQL, but are supposed to be added in PostgreSQL version 9.2 (see http://wiki.postgresql.org/wiki/Index-only_scans). The problem with using index-only scans is that certain tuples may not be visible to certain transactions due to snapshot isolation, and the check for visibility of a tuple requires accessing the tuple.) You should be able to download a development copy of PostgreSQL 9.2 to get this functionality, since 9.2 is not yet a stable release.
4. Monitor row and cost estimates Although statistics such as number of rows in an intermediate result are reasonably accurate for simple queries, they can be quite wrong for more complex queries. Your goal is to modify the execution engine to keep track of actual row counts and report large violations. A related problem is that the cost estimation is based on assuming nothing is in the buffer. Instrument the execution engine to gather for each query, statistics on how often the query found required pages already in the buffer. Store this information in a database table, for potential future use by a query optimizer. For tips on how to do this, see http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php.
5. Optimize a special case of join For a keyword querying paper, we had to generate queries of the form SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null. Such a query can be broken up and executed as SELECT * from R, S WHERE R.A=S.A UNION SELECT * from R, S WHERE R.A is null UNION SELECT * from R, S WHERE S.A is null. but this requires multiple scans of the relations. Instead, modify one of join operators (say merge join) to directly handle such a join condition efficiently. You may assume that the number of tuples with null value will fit in memory.
6. PostgreSQL query optimization for Flash PostgreSQL has a single parameter which adjusts the relative cost of sequential and random IO. With increasing use of flash, it makes sense to make this 1, since random and sequential IO cost are roughly the same on flash. Your task is to automate this process by checking what disks each tablespace is running on, and if it is a flash disk set the parameter to 1 for that tablespace. See http://www.postgresql.org/docs/current/static/runtime-config-query.html#GUC-RANDOM-PAGE-COST for more information about setting this parameter using alter tablespace. How to find out if a disk is a flash disk? Figure this out for yourself, any reasonable way such as actually performing random IO and timing it, or finding out through the OS is acceptable, but for starters assume that the list of devices that are flash disks is provided to you. Also try to get some queries where this change results in the change of plan (I can give you access to a flash disk to test actual costs.)
7. (Hard Project) Row level security See the post: http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php (if you are interested, we have a paper on this topic in SIGMOD 2006, which you can find here, along with a talk here.)