Title: Storage and Query Processing Methods for Dataset Versioning
Mr. Amit Chavan, University of Maryland, College Park
Date & Time: January 5, 2018 11:00
Venue: Conference Room, 01st Floor, C Block, Department of Computer Science and Engineering, Kanwal Rekhi (KReSIT) Building
Data-driven methods and products are becoming increasingly common in a variety of communities, leading to a huge diversity of datasets being continuously generated, modified, and analyzed. An increasingly important consideration for the underlying data management systems is that, all of these datasets and their versions over time need to be stored and queried for a variety of reasons including auditing, provenance, transparency, accountability, introspective analysis, and backups. To this effect, we present DEX, a novel stand-alone delta-oriented storage and query processing engine, whose goals are to provide efficient storage for the thousands of dataset versions (snapshots) and simultaneously enable rich query processing across them. DEX uses delta encoding to compactly store the large number of datasets by exploiting redundancies across them, and also keeps the average cost of reconstructing any dataset low. This talk will introduce: (i) a framework to reason about the storage--recreation tradeoff using delta-encoding, i.e., intuitively, the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions; and (ii) how DEX takes advantage of the already computed deltas between the datasets for efficient processing of a class of set-based queries across versions, namely, intersection, union, and t-threshold.
Amit Chavan is a PhD candidate in the Department of Computer Science at the University of Maryland, College Park, co-advised by Amol Deshpande and Aravind Srinivasan. He did his Bachelors degree in Computer Engineering (2011) from Veermata Jijabai Technological Institute, India. He is interested in data management and data-intensive computing, with a particular focus on storage and query processing challenges in a dataset version control system. He has papers in these topics at CIDR 2015, VLDB 2015 and SIGMOD 2017, in addition to several workshop papers.
