Project Ideas
- Project suggestions from CS 631 offering in Spring 2012 are
here .
- Some project suggestions from CS 631 offering in Autumn 2010 are
here .
- The PostgreSQL TODO list, available at
http://wiki.postgresql.org/wiki/Todo
is a great source of ideas for PostgreSQL extensions. It is a very detailed list however,
and it will take time to wade through it in detail, plus many of the ideas there are not suited
for a course project. So here are a few ideas gleaned from that list which I think
may be doable as a course project:
This list is being updated, visit again to find new suggestions
- Temporal data support:
(new)
PostgreSQL 9.2 supports a new range type, and allows
queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to
be extended to check for temporal (non)overlap.
Using this new feature, add support for syntax to declare relations as temporal, upon which
you will add a new valid-time attribute for storing time range.
Also perform query rewriting so that joins between
two temporal relations are turned into joins with the ranges taken into account.
If time permits extend to other operations such as projection (where you may have to
merge ranges) and group by (exercise: figure out how to handle semantics of group by with time).
For more on temporal data, read the relevant section of the Database Concepts book,
and then explore more on the net.
- Index tuning wizard Part A
(new)
The goal is to find which indices will best benefit a given workload (set) of queries.
Your program will take as input a set of queries/updates, and a list of potential indices,
and for each of the indices, find out its benefit. To do so, you will create a fake
index, by modifying the index creation code to add the index to the catalog, but not
actually create it. Then find the optimal plan for each query by invoking the optimizer.
The modification to the database itself is to add support (starting from syntax)
to create and drop fake indices. Then external to the database, you invoke explain
plan to find the cost of each of the queries before and after adding each fake index,
to find the benefit of the index for the query (can be negative for updates).
Add up benefits across all queries to find overall benefit of the index.
Choose the index that gives maximum benefit, and if the benefit is positive,
the index is created (in fake mode). This process can be repeated to choose
further indices. At the end drop all the fake indices, and output
the benefits found for each index when it was chosen, and output a script
to create the actual indices.
- Index tuning wizard Part B
(new)
This part goes over the queries in a workload, and finds candidate indices
i.e. indices that have a potential to benefit one or more queries in the workload.
Candidate indices are those attributes on which there are selection predicates, and
those attributes that participate in join conditions. If selections/joins involve
multiple attributes, candidate indices can correspondingly be on multiple attributes.
Your code should also work with subqueries, and correlation conditions in subqueries.
Add intelligence to eliminate indices where the statistics show that there are too
many duplicate values, so the index is not likely to be beneficial.
Also add intelligence to merge indices, e.g if query 1 needs an index on A,
and query 2 on {A,B} (i.e. order is not relevant), then a single index
on A,B will suffice (if query 2 needs an index on (B,A) in that order, then
we cannot merge the indices).
You will have to deal primarily with the parser and the statistics module.
If time permits also also add indices that can support index only plans.
E.g. given an index on attributes (A,B) a query SELECT B FROM r WHERE A = 4
can run without accessing the relation, using only the index.
Thus you should consider the index (A,B) for this query even though B is not
used in a selection or join.
Index only plans were not there earlier in PostgreSQL, but are have been added in PostgreSQL version
9.2
(see http://wiki.postgresql.org/wiki/Index-only_scans).
The problem with using index-only scans is that certain tuples may not be visible to certain transactions
due to snapshot isolation, and the check for visibility of a tuple requires accessing the tuple.
PostgresQL 9.2, which was released on Sep 10, 2012, includes some clever tricks to make
index-only scans work by keeping track of which pages have only tuples that are visible to all transactions.
PostgreSQL 9.2 can be downloaded from the PostgreSQL web site.
- Optimize or-is-null join
(updated from last semesters version)
For a keyword querying paper, we had to generate queries of the form
SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null.
Such a query can be broken up and executed as
SELECT * from R, S WHERE R.A=S.A UNION
SELECT * from R, S WHERE R.A is null UNION
SELECT * from R, S WHERE S.A is null.
but this requires multiple scans of the relations. Instead, modify one of join operators
(say merge join) to directly handle such a join condition efficiently.
You may assume that the number of tuples with null value will fit in memory.
This project was done by two groups last year, for single attributes. Extend their code
to handle multiple attributes, where some subset may be null.
- Row level security
(changed from last semester)
PostgresQL 9.2 has added a new feature called security barrier and leakproof; see
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
section on Security barriers and Leakproof.
This solves a problem reported earlier in the post:
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
The goal of this project is to implement some subset of the syntactic constructs we have
defined in an ICDE 2007 paper
http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf
for grants with predicates, implemented by with query rewriting using views. Use the
new security_barrier feature when defining the views.
(Side note: We have a paper on this topic in SIGMOD 2006, which you can find
here, along with a talk
here
if you are interested.)
- Division operation
(new)
Implement the relational algebra division operator using sorting.
Also add syntactic support to SQL to allow users to specify the division operator
(think of it as a variation of the outer join operation, and follow the code
for outerjoin to figure out how to modify the parser and planner so that the
operation is added to the execution plan).
You may benefit from reading the code for the extended join operator (or-is-null, above)
from last year, which had to do similar modifications.
- Table partitioning
(new)
Many database support table partitioning, where the rows are partitioned based on some
condition, such as year, and stored separately. A query that specifies a year
need only access one partition, and not others, and can thus run much more efficiently.
Indices are created separately on each partition. Such partitioning is particularly
useful for enterprise information systems, where most queries access only relatively
recent data (typically the past year or two at most). (Queries that span partitions
of course have to access all partitions, and thus pay a price.)
Some databases support specific syntax for declaring a table as partitioned
(see Oracle or SQL Server documentation for syntax).
PostgreSQL does not have direct syntax for this, but provides a way to implement
it using table inheritance. See
http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html
for details. Queries that need to access only one partition (e.g. on year) get
rewritten to run on only that subtable, not the entire table.
The goal of this project is to provide syntactic support for partitioning along the lines
of Oracle or SQL Server, and to automatically create the required subtables and constraints,
and triggers to retarget inserts to the correct subtable. See the above web page for
details. You should also create new "system" relations to track the metadata
for partitioned tables, so one can check if a table is partitioned, and on
what conditions, and what are the subtables.
Also, when indices are created on the main table, automatically create the
same index on subtables (to do this, modify the index creation code to check
if the table is partitioned, and if so create indices on the subtables
instead).
- Monitor row and cost estimates
(dropped since it is already available using EXPLAIN (analyze,buffers) query,
but do try it out for fun)