Institute for Systems Research Technical Reports
Permanent URI for this collectionhttp://hdl.handle.net/1903/4376
This archive contains a collection of reports generated by the faculty and students of the Institute for Systems Research (ISR), a permanent, interdisciplinary research unit in the A. James Clark School of Engineering at the University of Maryland. ISR-based projects are conducted through partnerships with industry and government, bringing together faculty and students from multiple academic departments and colleges across the university.
Browse
Search Results
Item Efficient Retrieval of Similar Time Sequences Under Time Warping(1997) Yi, B.; Jagadish, H.V.; Faloutsos, Christos; Faloutsos, Christos; ISRFast similarity searching in large time-sequence databases has attracted a lot of research interest. All of them use the Euclidean distance ($L_2$), or some variation of $L_p$ metric. $L_p$ metrics lead to efficient indexing, thanks to feature extraction (e.g., by keeping the first few DFT coefficients) and subsequent use of fast spatial access methods for the points in feature space. In this work we examine a popular, field-tested dissimilarity function, the "time warping" distance function which permits local accelerations and decelerations in the rate of the signals or sequences. This function is natural and suitable for several applications, like matching of voice, audio and medical signals (e.g., electrocardiograms). However, from the indexing viewpoint it presents two major challenges: (a) it does not lead to any natural "features", precluding the use of spatial access methods (b) it is quadratic ($O(len_1 * len_2)$) on the length of the sequences involved. Here we show how to overcome both problems: for the former, we propose using a modification of the so-called "FastMap", to map sequences into points, trading off a tiny amount of "recall" (typically zero) for large gains in speed. For the latter, we provide a fast, linear test, to help us discard quickly many of the false alarms that FastMap will typically introduce. Using both ideas in cascade, our proposed method achieved up to 7.8-time speed-up over the straightforward sequential scanning, on both read and synthetic datasets.Item Recovering Information from Summary Data(1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISRData is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.
In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.
Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.
We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.
Item Recovering Information from Summary Data(1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISRData is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.
In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.
Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.
We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.
Item Analysis of the n-dimensional quadtree decomposition for arbitrary hyper-rectangles(1994) Faloutsos, Christos; Jagadish, H.V.; Manolopoulos, Yannis; ISRWe give a closed-form expression for the average number of n- dimensional quadtree nodes (ieces' or locks') required by an n-dimensional hyper-rectangle aligned with the axes. Our formula includes as special cases the formulae of previous efforts for 2- dimensional spaces [8]. It also agrees with theoretical and empirical results that the number of blocks depends on the hyper- surface of the hyper-rectangle and not on its hyper-volume. The practical use of the derived formula is that it allows the estimation of the space requirements of the n-dimensional quadtree decomposition. Quadtrees are used extensively in 2- dimensional spaces (geographic information systems and spatial databases in general), as well in higher dimensionality spaces (as oct-trees for 3-dimensional spaces, e.g. in graphics, robotics and 3-dimensional medical images [2]). Our formula permits the estimation of the space requirements for data hyper- rectangles when stored in an index structure like a (n- dimensional) quadtree, as well as the estimation of the search time for query hyper-rectangles. A theoretical contribution of the paper is the observation that the number of blocks is a piece-wise linear function of the sides of the hyper-rectangle.Item The TV-tree -- an Index Structure for High-Dimensional Data(1994) Lin, King-Ip D.; Jagadish, H.V.; Faloutsos, Christos; ISRWe propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such 'varying length' feature vectors. Finally we report simulation results, comparing the proposed structure with the R*-tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.Item Diamond-Tree: An Index Structure for High-Dimensionality Approximate Searching(1992) Faloutsos, Christos; Jagadish, H.V.; ISRA selection query applied to a database often has the selection predicate imperfectly specified. We present a technique, called the Diamond-tree, for indexing fields to perform similarity-based retrieval, given some applicable measures of approximation. Typically, the number of features (or dimensions of similarity) is large, so that the search space has a high-dimensionality, and most traditional methods perform poorly. As a test case, we show how the Diamond-tree technique can be used to perform retrievals based on incorrectly or approximately specified values for string fields. Experimental results show that our method can respond to approximately match queries by examining a small portion (1% - 5%) of the database.