Show simple item record

dc.contributor.advisorDeshpande, Amolen_US
dc.contributor.authorChavan, Amiten_US
dc.date.accessioned2018-09-12T05:42:17Z
dc.date.available2018-09-12T05:42:17Z
dc.date.issued2018en_US
dc.identifierhttps://doi.org/10.13016/M2RR1PQ80
dc.identifier.urihttp://hdl.handle.net/1903/21240
dc.description.abstractData-driven methods and products are becoming increasingly common in a variety of communities, leading to a huge diversity of datasets being continuously generated, modified, and analyzed. An increasingly important consideration for the underlying data management systems is that, all of these datasets and their versions over time need to be stored and queried for a variety of reasons including, but not limited to, reproducibility, collaboration, provenance, auditing, introspective analysis, and backups. However, most solutions today resort to highly ad hoc and manual version management and sharing techniques, that leads to friction when managing collaborative data science workflows, while also introducing inefficiencies. In this dissertation, we introduce a framework for dataset version management, and address the systems building, operator design, and optimization challenges involved in building a dataset version control system. We describe the various challenges and solutions in the context of our system, called DEX, that we have developed to support increasingly complex version management tasks. We show how to use delta-encoding, a key component in managing redundancy, to provide efficient storage and retrieval for the thousands of dataset versions, and develop a formalism to understand the various trade-offs in a principled manner. We study the storage--recreation trade-off in detail and provide a suite of inexpensive heuristics to obtain high-quality solutions under different settings. In order to provide a rich interface to specify version management tasks, we design a new query language, called VQUEL, with the ability to query dataset versions and provenance in a unified manner. We study how assumptions on the delta format can help in the design of a logical algebra, which we then use to execute increasingly complex queries efficiently. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). Finally, we demonstrate the effectiveness of our developed techniques by extensive evaluation of DEX on a mixture of real-world and synthetic datasets.en_US
dc.language.isoenen_US
dc.titleDelta-based Storage and Querying for Versioned Datasetsen_US
dc.typeDissertationen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.contributor.departmentComputer Scienceen_US
dc.subject.pqcontrolledComputer scienceen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record