Institute for Systems Research Technical Reports

Permanent URI for this collectionhttp://hdl.handle.net/1903/4376

This archive contains a collection of reports generated by the faculty and students of the Institute for Systems Research (ISR), a permanent, interdisciplinary research unit in the A. James Clark School of Engineering at the University of Maryland. ISR-based projects are conducted through partnerships with industry and government, bringing together faculty and students from multiple academic departments and colleges across the university.

Browse

Search Results

Now showing 1 - 10 of 22

Efficient Retrieval of Similar Time Sequences Under Time Warping
(1997) Yi, B.; Jagadish, H.V.; Faloutsos, Christos; Faloutsos, Christos; ISR
Fast similarity searching in large time-sequence databases has attracted a lot of research interest. All of them use the Euclidean distance ($L_2$), or some variation of $L_p$ metric. $L_p$ metrics lead to efficient indexing, thanks to feature extraction (e.g., by keeping the first few DFT coefficients) and subsequent use of fast spatial access methods for the points in feature space. In this work we examine a popular, field-tested dissimilarity function, the "time warping" distance function which permits local accelerations and decelerations in the rate of the signals or sequences. This function is natural and suitable for several applications, like matching of voice, audio and medical signals (e.g., electrocardiograms). However, from the indexing viewpoint it presents two major challenges: (a) it does not lead to any natural "features", precluding the use of spatial access methods (b) it is quadratic ($O(len_1 * len_2)$) on the length of the sequences involved. Here we show how to overcome both problems: for the former, we propose using a modification of the so-called "FastMap", to map sequences into points, trading off a tiny amount of "recall" (typically zero) for large gains in speed. For the latter, we provide a fast, linear test, to help us discard quickly many of the false alarms that FastMap will typically introduce. Using both ideas in cascade, our proposed method achieved up to 7.8-time speed-up over the straightforward sequential scanning, on both read and synthetic datasets.
Quantifiable Data Mining Using Principal Component Analysis
(1997) Faloutsos, Christos; Korn, Flip; Labrinidis, Alexandros; Kotidis, Yannis; Kaplunovich, Alex; Perkovic, Dejan; ISR
Association Rule Mining algorithms operate on a data matrix (e.g., customers x products) to derive rules [2,23]. We propose a single-pass algorithm for mining linear rules in such a matrix based on Principal Component Analysis. PCA detects correlated columns of the matrix, which correspond to, e.g., products that sell together.
The first contribution of this work is that we propose to quantify the ﲧoodness of a set of discovered rules. We define the ﲧuessing error : the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. The second contribution is a novel method to guess missing/hidden values from the linear rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can ﲧuess the amount spent on, say, butter. Thus, we can perform a variety of important tasks such as forecasting, hat-if' scenarios, outlier detection, and visualization. Moreover, we show that we can compute the principal components with a single pass over the dataset.
Experiments on real datasets (e.g., NBA statistics) demonstrate that the proposed method consistently achieves a ﲧuessing error of up to 5 times lower than the straightforward competitor.
Recovering Information from Summary Data
(1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISR
Data is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.
We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.
In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.
Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.
We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.
Recovering Information from Summary Data
(1997) Faloutsos, Christos; Jagadish, H.V.; Sidiropoulos, N.D.; ISR
Data is often stored in summarized form, as a histogram of aggregates (COUNTs,SUMs, or AVeraGes) over specified ranges. Queries regarding specific values, or ranges different from those stored, cannot be answered exactly from the summarized data. In this paper we study how to estimate the original detail data from the stored summary.
We formulate this task as an inverse problem, specifying a well-defined cost function that has to be optimized under constraints.
In particular, we propose the use of a Linear Regularization method, which ﲭaximizes the smoothness of the estimate. Our main theoretical contribution is a Theorem, which shows that, for smooth enough distributions, we can achieve full recovery from summary data.
Our theorem is closely related to the well known Shannon-Nyquist sampling theorem.
We describe how to apply this theory to a variety of database problems, that involve partial information, such as OLAP, data warehousing and histograms in query optimization. Our main practical contribution is that the Linear Regularization method is extremely effective, both on synthetic and on real data. Our experiments show that the proposed approach almost consistently outperforms the ﲵniformity assumption, achieving significant savings in root-mean-square error: up to 20% for stock price data, and up to 90% for smoother data sets.
Fast Nearest Neighbor Search in Medical Image Databases
(1996) Korn, Flip; Sidiropoulos, N.; Faloutsos, Christos; ISR
We examine the problem of finding similar tumor shapes. Starting from a natural similarity function (the so-called ax morphological distance'), we showed how to lower-bound it and how to search for nearest neighbors in large collections of tumor- like shapes.
Specifically, we used state-of-the-art concepts from morphology, namely the attern spectrum' of a shape, to map each shape to a point in n-dimensional space. Following [19, 36], we organized the n-d points in an R-tree. We showed that the L (= max) norm in the n-d space lower-bounds the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest neighbor algorithm that also guarantees no false dismissals.
Finally, we implemented the method, and we tested it against a testbed of realistic tumor shapes, using an established tumor-growth model of Murray Eden [15]. The experiments showed that our method is up to 27 times faster than straightforward sequential scanning.
Estimating the Selectivity of Spatial Queries Using the orrelation' Fractal Dimension
(1995) Belussi, Alberto; Faloutsos, Christos; ISR
We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the ﲵniformity' and ndependence' assumptions, (b) can often be described as ﲦractals , with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called ﲃorrelation Dimensions D2 is the one that we need to predict the selectivity of spatial join.
The main contribution is that, for all the real and synthetic point- sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for ﲢiased range queries (i.e., queries whose centers prefer areas of high point density).
We present the formulas to estimate the selectivity for the biased queries, including an integration constant (K hape' ) for each query shape. Finally, we show results on real and synthetic points sets, where our formulas achieve very low relative errors (typically about 10%, versus 40% - 100% of the uniform assumption).
Similarity Searching in Large Image DataBases
(1994) Petrakis, E.G.M.; Faloutsos, Christos; ISR
We propose a method to handle approximate searching by image content in large image databases. Image content is represented by attributed relational graphs holding features of objects and relationships between objects. The method relies on the assumption that a fixed number of ﲬabeled or ﲥxpected objects (e.g. ﲨeart lungs etc.) are common in all images of a given application domain in addition to a variable number of ﲵnexpected or ﲵnlabeled objects (e.g. ﲴumor , hematoma etc.). The method can answer queries by example such as ﲦind all X-rays that are similar to Smith's X-ray . The stored images are mapped to points in a multidimentional space ad are indexed using state- of-the-art database methods (R-trees). The proposed method has several desirable desirable properties: (a) Database search is approximate so that all images up to a pre-specified degree of similarity (tolerance) are retrieved, (b) it has no ﲦalse dismissals (i.e., all images qualifying query selection criteria are retrieved) and (c) it scales-up well as the database grows. We implemented the method and ran experiments on a database of synthetic (but realistic) medical images. The experiments showed that our method significantly outperforms sequential scanning by up to an order of magnitude.
Analysis of the n-dimensional quadtree decomposition for arbitrary hyper-rectangles
(1994) Faloutsos, Christos; Jagadish, H.V.; Manolopoulos, Yannis; ISR
We give a closed-form expression for the average number of n- dimensional quadtree nodes (ieces' or locks') required by an n-dimensional hyper-rectangle aligned with the axes. Our formula includes as special cases the formulae of previous efforts for 2- dimensional spaces [8]. It also agrees with theoretical and empirical results that the number of blocks depends on the hyper- surface of the hyper-rectangle and not on its hyper-volume. The practical use of the derived formula is that it allows the estimation of the space requirements of the n-dimensional quadtree decomposition. Quadtrees are used extensively in 2- dimensional spaces (geographic information systems and spatial databases in general), as well in higher dimensionality spaces (as oct-trees for 3-dimensional spaces, e.g. in graphics, robotics and 3-dimensional medical images [2]). Our formula permits the estimation of the space requirements for data hyper- rectangles when stored in an index structure like a (n- dimensional) quadtree, as well as the estimation of the search time for query hyper-rectangles. A theoretical contribution of the paper is the observation that the number of blocks is a piece-wise linear function of the sides of the hyper-rectangle.
Fast Map: A Fast Algorithms for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets
(1994) Faloutsos, Christos; Lin, King-Ip D.; ISR
A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [Jag91]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the uery By Example' type (which translates to a range query); the ll pairs' query (which translates to a spatial join [BKSS94]); the nearest-neighbor or best match query, etc.
However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.
This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k- dimensional space ( k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, ad discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.
We introduce an older method from pattern recognition, namely, multi-Dimentional Scaling (MIDS) [Tor52]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MIDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.
The TV-tree -- an Index Structure for High-Dimensional Data
(1994) Lin, King-Ip D.; Jagadish, H.V.; Faloutsos, Christos; ISR
We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such 'varying length' feature vectors. Finally we report simulation results, comparing the proposed structure with the R*-tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.

Institute for Systems Research Technical Reports

Browse

Filters

Settings

Sort By

Results per page

Search Results