Institute for Systems Research Technical Reports

Permanent URI for this collectionhttp://hdl.handle.net/1903/4376

This archive contains a collection of reports generated by the faculty and students of the Institute for Systems Research (ISR), a permanent, interdisciplinary research unit in the A. James Clark School of Engineering at the University of Maryland. ISR-based projects are conducted through partnerships with industry and government, bringing together faculty and students from multiple academic departments and colleges across the university.

Browse

Search Results

Now showing 1 - 3 of 3
  • Thumbnail Image
    Item
    Recovering Information from Summary Data
    (1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISR
    Data is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.

    We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.

    In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.

    Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.

    We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.

  • Thumbnail Image
    Item
    Recovering Information from Summary Data
    (1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISR
    Data is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.

    We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.

    In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.

    Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.

    We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.

  • Thumbnail Image
    Item
    The TV-tree -- an Index Structure for High-Dimensional Data
    (1994) Lin, King-Ip D.; Jagadish, H.V.; Faloutsos, Christos; ISR
    We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such 'varying length' feature vectors. Finally we report simulation results, comparing the proposed structure with the R*-tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.