Project Ideas
Project suggestions from CS 631 offering in past years:
Autumn 2013 ,
Autumn 2012 ,
Spring 2012 , and
Autumn 2010 .
The PostgreSQL TODO list, available at
http://wiki.postgresql.org/wiki/Todo
is a great source of ideas for PostgreSQL extensions. It is a very detailed list however,
and it will take time to wade through it in detail, plus many of the ideas there are not suited
for a course project. So here are a few ideas gleaned from that list which I think
may be doable as a course project:
To learn more about POstgreSQL, visit
http://www.postgresql.org/docs/current/static/index.html;
this site has a tremendous amount of detail, so you cannot read it all, but you should read the parts relevant to your
project.
This list is being updated as of Sat Aug 16, 2014, visit again to find new suggestions
- Cube/Rollup/Grouping Sets extensions to group by
(new)
See
https://wiki.postgresql.org/wiki/Grouping_Sets for more information.
To understand cube/rollup operators, read Chapter 5 of the course textbook Database System Concepts 6th Ed.
- Query Progress Estimation (new)
When a query takes a long time, it is important to know how much progress it has made.
There has been some work on estimating query progress. Some of that work is quite non-trivial to implement, but you can
implement at least some very simple cases, e.g. when a plan is purely pipelined, and there are no blocking operations.
See
Estimating progress of execution for SQL queries and
http://wiki.postgresql.org/wiki/Query_progress_indication
for more information.
- Operating Directly on Text Files (new)
PostgreSQL like other databases requires data to be loaded into the database before it can be queried.
There are many data sources with textual data such as log files which can be parsed and loaded into the database,
but this increases the storage cost greatly, in addition to requiring significant load time.
Your task here is to allow PostgreSQL to define a view/relation that does not store data, but instead
reads data from a file (say a comma-separated values (CSV) file) when required.
For more details, see
-
NoDB in Action: Adaptive Query Processing on Raw Data
Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos and Anastasia Ailamaki, VLDB 2013, and
- I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and
A. Ailamaki. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD, pages 241-252, 2012.
These papers describe a lot of optimizations; you do not have to implement all or even most, a basic working
engine with a few simple optimizations or even no optimization is what I expect out of a course project.
- In-memory Column Store (new)
Columnar representations have become very popular for in-memory data due to the opportunities for compression that they
offer. An in-memory columnar representation has been implemented as a patch on PostgreSQL already, and you can find more info at
http://www.pgcon.org/2014/schedule/attachments/322_IMCS.pdf.
But you can still implement a subset of the features described in the above presentation on your own.
- Temporal data support:
PostgreSQL 9.2 supports a new range type, and allows
queries to check for range overlap (using &&) and evem allows primary key/foreign key constraints to
be extended to check for temporal (non)overlap.
Using this new feature, add support for syntax to declare relations as temporal, upon which
you will add a new valid-time attribute for storing time range.
Also perform query rewriting so that joins between
two temporal relations are turned into joins with the ranges taken into account.
If time permits extend to other operations such as projection (where you may have to
merge ranges) and group by (exercise: figure out how to handle semantics of group by with time).
For more on temporal data, read the relevant section of the Database Concepts book,
and then explore more on the net.
- Optimize or-is-null join
(old, done by several teams, but you can still take code from earlier teams and add new features)
For a keyword querying paper, we had to generate queries of the form
SELECT * from R, S WHERE R.A=S.A or R.A is null or S.A is null.
Such a query can be broken up and executed as
SELECT * from R, S WHERE R.A=S.A UNION
SELECT * from R, S WHERE R.A is null UNION
SELECT * from R, S WHERE S.A is null.
but this requires multiple scans of the relations. Instead, modify the hash-join operator
to directly handle such a join condition efficiently.
This project was partly implemented as part of an MTech project last year. Extend the existing
code to better handle multiple attributes, and remove some inefficiencies.
An external research scholar, Pankaj Vanwari (pankajvanwari@gmail.com) can help with some of the details.
This project was done earlier by grouping tuples into 2^N groups
(where N is the number of attributes).
A tuple that belongs to group 101 has attribute B as null whereas A and C are not null.
Joining tuples is as per its group, say tuples of group 101 joins
with tuples of group 110 on A (Since 101 “AND ” 110 = 100).
It is expected that the project builds only the required indices on all groups
(by possibly doing a sequential scans on both relations to first find all groups) and then join a tuple with all relevant groups of the other relation using
appropriate indices.
If number of tuples in a group are more then we may need to partition the
group into pieces that fit into memory; this extension can be done
if time permits, for now you can assume that each groups fit in memory.
- Row level security (was there last year, but no one did it so it counts as new)
PostgresQL 9.2 has added a new feature called security barrier and leakproof; see
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
section on Security barriers and Leakproof.
This solves a problem reported earlier in the post:
http://archives.postgresql.org/pgsql-hackers/2009-10/msg01346.php
The goal of this project is to implement some subset of the syntactic constructs we have
defined in an ICDE 2007 paper
http://www.cse.iitb.ac.in/~sudarsha/Pubs-dir/FGA-model-ICDE07.pdf
for grants with predicates, implemented by with query rewriting using views. Use the
new security_barrier feature when defining the views.
(Side note: We have a paper on this topic in SIGMOD 2006, which you can find
here, along with a talk
here
if you are interested.)
- Division operation Deleted
This has been done by many groups over the past few years,
so I don't want yet another implementation.
- Table partitioning A
Deleted
This has been done by many groups over the past few years,
so I don't want yet another implementation.
- Table partitioning for parallelism
Table partitioning allows a table to be split into multiple pieces, which can be
stored as separate relations on a single machine (last years project, now deleted since
too many people did it), or on different nodes in a parallel system.
For details on table partitioning in PostgreSQL, see
https://wiki.postgresql.org/wiki/Table_partitioning.
One goal is to extend table partitioning to handle multiple tables which are related,
in the context of a parallel database.
Partitioning is specified for one of the tables, lets call it the master table. Each partition is
to be placed on a different server.
Other tables that are linked to this table by foreign keys referencing the master table
then get partitioned automatically, based on the partitioning of the master table.
E.g. if the master table is user, a table photos which is linked to the user by a foreign key
will get partitioned with the user.
Tables referenced by the master table get logically replicated to all partitions.
For example, a table user_category referenced by user can be assumed to be replicated at all nodes.
The second goal of this project could be any one of the following:
- to perform query rewriting, so a query only accesses partitions that it needs to access. You should be able
to handle join queries.
- to implement the partitioning by sending insert/update/read requests to an appropriate copy of PostgreSQL.
No need to do any query rewriting, you can restrict attention to single table selection queries, but you should
have at least two copies of PostgreSQL working together, sending requests to each other.