Dwarf: Shrinking the PetaCube
Dwarf: Shrinking the PetaCube
Files
Publication or External Link
Date
2002-04-04
Authors
Sismanis, Yannis
Deligiannakis, Antonios
Roussopoulos, Nick
Kotidis, Yannis
Advisor
Citation
DRUM DOI
Abstract
Dwarf is a highly compressed structure for computing, storing, and
querying data cubes. Dwarf identifies prefix and suffix structural
redundancies and factors them out by coalescing their store. Prefix
redundancy is high on dense areas of cubes but suffix redundancy
is significantly higher for sparse areas. Putting the two together
fuses the exponential sizes of high dimensional full cubes into a
dramatically condensed data structure. The elimination of suffix
redundancy has an equally dramatic reduction in the computation of the
cube because recomputation of the redundant suffixes is avoided. This
effect is multiplied in the presence of correlation amongst attributes
in the cube. A Petabyte 25-dimensional cube was shrunk this way to a
2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage
reduction ratio. Still, Dwarf provides 100% precision on cube
queries and is a self-sufficient structure which requires no access to
the fact table. What makes Dwarf practical is the automatic
discovery, in a single pass over the fact table, of the prefix and
suffix redundancies without user involvement or knowledge of the
value distributions.
This paper describes the Dwarf structure and the Dwarf cube
construction algorithm. Further optimizations are then introduced for
improving clustering and query performance. Experiments with the
current implementation include comparisons on detailed measurements
with real and synthetic datasets against previously published
techniques. The comparisons show that Dwarfs by far outperform these
techniques on all counts: storage space, creation time, query response
time, and updates of cubes.
Also UMIACS-TR-2002-24