Browsing by Author "Sismanis, Yannis"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item The Dwarf Data Cube Eliminates the Highy Dimensionality Curse(2003-12-18) Sismanis, Yannis; Roussopoulos, NickThe data cube operator encapsulates all possible groupings of a data set and has proved to be an invaluable tool in analyzing vast amounts of data. However its apparent exponential complexity has significantly limited its applicability to low dimensional datasets. Recently the idea of the dwarf data cube model was introduced, and showed that high-dimensional ``dwarf data cubes'' are orders of magnitudes smaller in size than the original data cubes even when they calculate and store every possible aggregation with 100\% precision. In this paper we present a surprising analytical result proving that the size of dwarf cubes grows polynomially with the dimensionality of the data set and, therefore, a full data cube at 100% precision is not inherently cursed by high dimensionality. This striking result of polynomial complexity reformulates the context of cube management and redefines most of the problems associated with data-warehousing and On-Line Analytical Processing. We also develop an efficient algorithm for estimating the size of dwarf data cubes before actually computing them. Finally, we complement our analytical approach with an experimental evaluation using real and synthetic data sets, and demonstrate our results. UMIACS-TR-2003-120Item Dwarf: Shrinking the PetaCube(2002-04-04) Sismanis, Yannis; Deligiannakis, Antonios; Roussopoulos, Nick; Kotidis, YannisDwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery, in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions. This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far outperform these techniques on all counts: storage space, creation time, query response time, and updates of cubes. Also UMIACS-TR-2002-24Item Shared Index Scans For Data Warehouses(2001-05-10) Kotidis, Yannis; Sismanis, Yannis; Roussopoulos, NickTree based indexing structures like B-trees, B+trees, Bitmap indexes and R-trees have become essential for getting good performance when accessing vast datasets. However, most database research seems to ignore the behavior that the disk hardware observes during index scans. In this paper we aim to refocus attention on efficiently utilizing the underlying hardware during concurrent index scans. We propose a new "transcurrent execution model" (TEM) for concurrent user queries against tree indexes. Our model exploits intra-parallelism of the index scan and dynamically decomposes each query into a set of disjoint "query patches". TEM integrates the ideas of prefetching and shared scans in a new framework, suitable for dynamic multi-user environments. It supports time constraints in the scheduling of these patches and introduces the notion of data flow for achieving a steady progress of all queries. Our experiments demonstrate that the transcurrent query execution results in high locality of I/O which in turn translates to substantial performance benefits in terms of query execution time, buffer hit ratio and disk throughput. These benefits increase as the workload in the warehouse increases and offer a highly scalable solution to the I/O problem of data warehouses. (Cross-referenced as UMIACS-TR-2001-22)