Project Ideas
- Project suggestions from CS 631 offering in past semester:
Autumn 2012 ,
Spring 2012 , and
Autumn 2010 .
- The PostgreSQL TODO list, available at
http://wiki.postgresql.org/wiki/Todo
is a great source of ideas for PostgreSQL extensions. It is a very detailed list however,
and it will take time to wade through it in detail, plus many of the ideas there are not suited
for a course project. So here are a few ideas gleaned from that list which I think
may be doable as a course project:
This list is being updated as of Monday Sep 16, 2013, visit again to find new suggestions
- Temporal data support:
(new)
PostgreSQL 9.2 supports a new range type, and allows
queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to
be extended to check for temporal (non)overlap.
Using this new feature, add support for syntax to declare relations as temporal, upon which
you will add a new valid-time attribute for storing time range.
Also perform query rewriting so that joins between
two temporal relations are turned into joins with the ranges taken into account.
If time permits extend to other operations such as projection (where you may have to
merge ranges) and group by (exercise: figure out how to handle semantics of group by with time).
For more on temporal data, read the relevant section of the Database Concepts book,
and then explore more on the net.
- Optimize or-is-null join
(updated from last semesters version)
For a keyword querying paper, we had to generate queries of the form
SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null.
Such a query can be broken up and executed as
SELECT * from R, S WHERE R.A=S.A UNION
SELECT * from R, S WHERE R.A is null UNION
SELECT * from R, S WHERE S.A is null.
but this requires multiple scans of the relations. Instead, modify the hash-join operator
to directly handle such a join condition efficiently.
This project was partly implemented as part of an MTech project last year. Extend the existing
code to better handle multiple attributes, and remove some inefficiencies.
An external research scholar, Pankaj Vanwari (pankajvanwari@gmail.com) can help with some of the details.
This project was done earlier by grouping tuples into 2^N groups
(where N is the number of attributes).
A tuple that belongs to group 101 has attribute B as null whereas A and C are not null.
Joining tuples is as per its group, say tuples of group 101 joins
with tuples of group 110 on A (Since 101 “AND ” 110 = 100).
It is expected that the project builds only the required indices on all groups
(by possibly doing a sequential scans on both relations to first find all groups) and then join a tuple with all relevant groups of the other relation using
appropriate indices.
If number of tuples in a group are more then we may need to partition the
group into pieces that fit into memory; this extension can be done
if time permits, for now you can assume that each groups fit in memory.
- Row level security
(changed from last semester)
PostgresQL 9.2 has added a new feature called security barrier and leakproof; see
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
section on Security barriers and Leakproof.
This solves a problem reported earlier in the post:
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
The goal of this project is to implement some subset of the syntactic constructs we have
defined in an ICDE 2007 paper
http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf
for grants with predicates, implemented by with query rewriting using views. Use the
new security_barrier feature when defining the views.
(Side note: We have a paper on this topic in SIGMOD 2006, which you can find
here, along with a talk
here
if you are interested.)
- Division operation
(new)
Implement the relational algebra division operator using sorting.
Also add syntactic support to SQL to allow users to specify the division operator
(think of it as a variation of the outer join operation, and follow the code
for outerjoin to figure out how to modify the parser and planner so that the
operation is added to the execution plan).
You may benefit from reading the code for the extended join operator (or-is-null, above)
from last year, which had to do similar modifications.
- Table partitioning A
(new)
Many database support table partitioning, where the rows are partitioned based on some
condition, such as year, and stored separately. A query that specifies a year
need only access one partition, and not others, and can thus run much more efficiently.
Indices are created separately on each partition. Such partitioning is particularly
useful for enterprise information systems, where most queries access only relatively
recent data (typically the past year or two at most). (Queries that span partitions
of course have to access all partitions, and thus pay a price.)
Some databases support specific syntax for declaring a table as partitioned
(see Oracle or SQL Server documentation for syntax).
PostgreSQL does not have direct syntax for this, but provides a way to implement
it using table inheritance. See
http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html
for details. Queries that need to access only one partition (e.g. on year) get
rewritten to run on only that subtable, not the entire table.
The goal of this project is to provide syntactic support for partitioning along the lines
of Oracle or SQL Server, and to automatically create the required subtables and constraints,
and triggers to retarget inserts to the correct subtable. See the above web page for
details. You should also create new "system" relations to track the metadata
for partitioned tables, so one can check if a table is partitioned, and on
what conditions, and what are the subtables.
Also, when indices are created on the main table, automatically create the
same index on subtables (to do this, modify the index creation code to check
if the table is partitioned, and if so create indices on the subtables
instead).
- Table partitioning B: partitioning for parallelism
New for this year, the goal is to extend table partitioning to handle multiple tables which are related,
in the context of a parallel database.
Partitioning is specified for one of the tables, lets call it the master table. Each partition is
to be placed on a different server.
Other tables that are linked to this table by foreign keys referencing the master table
then get partitioned automatically, based on the partitioning of the master table.
E.g. if the master table is user, a table photos which is linked to the user by a foreign key
will get partitioned with the user.
Tables referenced by the master table get logically replicated to all partitions.
For example, a table user_category referenced by user can be assumed to be replicated at all nodes.
The goal of this project could be any one of the following:
- to perform query rewriting, so a query only accesses partitions that it needs to access. You should be able
to handle join queries.
- to implement the partitioning by sending insert/update/read requests to an appropriate copy of PostgreSQL.
No need to do any query rewriting, you can restrict attention to single table selection queries, but you should
have at least two copies of PostgreSQL working together, sending requests to each other.