Browsing by Author "Faloutsos, Christos"
Results Per Page
Sort Options
Item Analysis of Object Oriented Spatial Access Methods.(1987) Faloutsos, Christos; Sellis, T.; Roussopoulos, N.; ISRThis paper provides an analysis of R-trees and a variation (R+ - trees) that avoids overlapping rectangles in intermediate nodes of the tree. The main contributions of the paper are the following. First, we show how the transformation of objects to higher dimensions (first proposed by Hinrichs and Nievergelt [HINR83]) can be effectively used as a tool for the analysis of R- and R+ -trees. Second, in this paper we provide the first known analysis of R-trees. Although formulas are given for objects in one dimension (line segments), they can be easily generalized for objects of higher dimensions as well. Finally, we derive formulas for R+ -trees and compare the two methods analytically. The results we obtained show that R+ -trees require less than half the disk accesses required by a corresponding R- tree when searching files of real life sizes. R+ -trees are clearly superior in cases where there are few long segments and a lot of small ones.Item Analysis of the Clustering Properties of Hilbert Space-filling Curve(1998-10-15) Moon, Bongki; Jagadish, H.V.; Faloutsos, Christos; Saltz, JoelSeveral schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-form formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time. (Also cross-referenced as UMIACS-TR-96-20)Item Analysis of the n-dimensional quadtree decomposition for arbitrary hyper-rectangles(1998-10-15) Faloutsos, Christos; Jagadish, H.V.; Manolopoulos, YannisWe give a closed-form expression for the average number of $n$-dimensional quadtree nodes (`pieces' or `blocks') required by an $n$-dimensional hyper-rectangle aligned with the axes. Our formula includes as special cases the formulae of previous efforts for 2-dimensional spaces \cite{Faloutsos92Analytical}. It also agrees with theoretical and empirical results that the number of blocks depends on the hyper-surface of the hyper-rectangle and not on its hyper-volume. The practical use of the derived formula is that it allows the estimation of the space requirements of the $n$-dimensional quadtree decomposition. Quadtrees are used extensively in 2-dimensional spaces (geographic information systems and spatial databases in general), as well in higher dimensionality spaces (as oct-trees for 3-dimensional spaces, e.g. in graphics, robotics and 3-dimensional medical images [Arya et al., 1994]. Our formula permits the estimation of the space requirements for data hyper-rectangles when stored in an index structure like a ($n$-dimensional) quadtree, as well as the estimation of the search time for query hyper-rectangles. A theoretical contribution of the paper is the observation that the number of blocks is a piece-wise linear function of the sides of the hyper-rectangle. (Also cross-referenced as UMIACS-TR-94-130)Item Analysis of the n-dimensional quadtree decomposition for arbitrary hyper-rectangles(1994) Faloutsos, Christos; Jagadish, H.V.; Manolopoulos, Yannis; ISRWe give a closed-form expression for the average number of n- dimensional quadtree nodes (ieces' or locks') required by an n-dimensional hyper-rectangle aligned with the axes. Our formula includes as special cases the formulae of previous efforts for 2- dimensional spaces [8]. It also agrees with theoretical and empirical results that the number of blocks depends on the hyper- surface of the hyper-rectangle and not on its hyper-volume. The practical use of the derived formula is that it allows the estimation of the space requirements of the n-dimensional quadtree decomposition. Quadtrees are used extensively in 2- dimensional spaces (geographic information systems and spatial databases in general), as well in higher dimensionality spaces (as oct-trees for 3-dimensional spaces, e.g. in graphics, robotics and 3-dimensional medical images [2]). Our formula permits the estimation of the space requirements for data hyper- rectangles when stored in an index structure like a (n- dimensional) quadtree, as well as the estimation of the search time for query hyper-rectangles. A theoretical contribution of the paper is the observation that the number of blocks is a piece-wise linear function of the sides of the hyper-rectangle.Item Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension",(1998-10-15) Faloutsos, Christos; Kamel, IbrahimWe propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, non-integer fractal dimension. Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the R-trees. We provide the {\em first} analysis of R-trees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5\%, and it rarely exceeds 10\%. We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for {\em any} spatial access method, as well as better estimates for query optimization on multi-attribute queries. NOTE - Appeared in PODS 1994. Christos Faloutsos and Ibrahim Kamel. "Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension", Proc. ACM SIGACT-SIGMOD-SIGART PODS. Minneapolis, MN (May 1994), pp. 4-13. (Also cross-referenced as UMIACS-TR-93-130)Item Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension(1993) Faloutsos, Christos; Kamel, Ibrahim; ISRWe propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, non-integer fractal dimension.Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the R-trees. We provide the first analysis of R- trees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5%, and it rarely exceeds 10%.
We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for any spatial access method, as well as better estimates for query optimization on multi-attribute queries.
Item Declustering R-Tree on Multi-Computer Architectures(1994) Koudas, N.; Faloutsos, Christos; Kamel, Ibrahim; ISRWe study a method to decluster a spatial access method (and specifically an R-tree) on a shared-nothing multi-computer architecture [9]. Our first step is to propose a software architecture, with the top levels of the R-tree on the 'master'server' and the leaf nodes distributed across the servers. Nest, we study the optimal capacity of leaf nodes, or 'chunk size'. We express the response time on range queries as a function of the 'chunk size', and we show how to optimize it. This formula assumes that the 'chunks' are perfectly declustered. We propose to use the Hilbert curve to achieve such a good declustering.Finally, we implemented our method on a network of workstations and we compared the experimental and the theoretical results. The conclusions are that (a) our formula for the response time is accurate (the maximum relative error was 30%; the typical error was in the vicinity of 10-15%) (b) the Hilbert-based declustering consistently outperforms a random declustering (c) most importantly, although the optimal chunk size depends on several factors (database size, size of the query, speed of the network), a safe choice for it is 1 page (whichever is the page size of the operating system). We show analytically and experimentally that a chunk size of 1 page gives either optimal or close to optimal results, for a wide range of the parameters.
Item Designing Access Methods for Bitemporal Databases(1998-10-15) Kumar, Anil; Tsotras, Vassilis J.; Faloutsos, ChristosBy supporting the valid and transaction time dimensions, bitemporal databases represent reality more accurately than conventional databases. In this paper we examine the issues involved in designing efficient access methods for bitemporal databases and propose the partial-persistence and the double-tree methodologies. The partial- persistence methodology reduces bitemporal queries to partial persistence problems for which an efficient access method is then designed. The double-tree methodology "sees" each bitemporal data object as consisting of two intervals (a valid-time and a transaction- time interval), and divides objects into two categories according to whether the right endpoint of the transaction time interval is already known. A common characteristic of both methodologies is that they take into account the properties of each time dimension. Their performance is compared with a straightforward approach that "sees" the intervals associated with a bitemporal object as composing one rectangle which is stored in a single multidimensional access method. Given that some limited additional space is available, our experimental results show that the partial- persistence methodology provides the best overall performance, especially for transaction timeslice queries. For those applications that require ready, off-the-shelf, access methods the double-tree methodology is a good alternative. (Also cross-referenced as UMIACS-TR-97-24)Item Diamond-Tree: An Index Structure for High-Dimensionality Approximate Searching(1992) Faloutsos, Christos; Jagadish, H.V.; ISRA selection query applied to a database often has the selection predicate imperfectly specified. We present a technique, called the Diamond-tree, for indexing fields to perform similarity-based retrieval, given some applicable measures of approximation. Typically, the number of features (or dimensions of similarity) is large, so that the search space has a high-dimensionality, and most traditional methods perform poorly. As a test case, we show how the Diamond-tree technique can be used to perform retrievals based on incorrectly or approximately specified values for string fields. Experimental results show that our method can respond to approximately match queries by examining a small portion (1% - 5%) of the database.Item Efficient Retrieval of Similar Time Sequences Under Time Warping(1997) Yi, B.; Jagadish, H.V.; Faloutsos, Christos; Faloutsos, Christos; ISRFast similarity searching in large time-sequence databases has attracted a lot of research interest. All of them use the Euclidean distance ($L_2$), or some variation of $L_p$ metric. $L_p$ metrics lead to efficient indexing, thanks to feature extraction (e.g., by keeping the first few DFT coefficients) and subsequent use of fast spatial access methods for the points in feature space. In this work we examine a popular, field-tested dissimilarity function, the "time warping" distance function which permits local accelerations and decelerations in the rate of the signals or sequences. This function is natural and suitable for several applications, like matching of voice, audio and medical signals (e.g., electrocardiograms). However, from the indexing viewpoint it presents two major challenges: (a) it does not lead to any natural "features", precluding the use of spatial access methods (b) it is quadratic ($O(len_1 * len_2)$) on the length of the sequences involved. Here we show how to overcome both problems: for the former, we propose using a modification of the so-called "FastMap", to map sequences into points, trading off a tiny amount of "recall" (typically zero) for large gains in speed. For the latter, we provide a fast, linear test, to help us discard quickly many of the false alarms that FastMap will typically introduce. Using both ideas in cascade, our proposed method achieved up to 7.8-time speed-up over the straightforward sequential scanning, on both read and synthetic datasets.Item Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension(1998-10-15) Belussi, Alberto; Faloutsos, ChristosWe examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier, real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic point-sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (Kshape) for each query shape. Finally, we show results on real and synthetic point sets, where our formulas achieve very low relative errors (typically about 10%, versus 40%-100% of the uniform assumption).Item Estimating the Selectivity of Spatial Queries Using the orrelation' Fractal Dimension(1995) Belussi, Alberto; Faloutsos, Christos; ISRWe examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier [FK94a], real point sets: (a) violate consistently the ﲵniformity' and ndependence' assumptions, (b) can often be described as ﲦractals , with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called ﲃorrelation Dimensions D2 is the one that we need to predict the selectivity of spatial join.The main contribution is that, for all the real and synthetic point- sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for ﲢiased range queries (i.e., queries whose centers prefer areas of high point density).
We present the formulas to estimate the selectivity for the biased queries, including an integration constant (K hape' ) for each query shape. Finally, we show results on real and synthetic points sets, where our formulas achieve very low relative errors (typically about 10%, versus 40% - 100% of the uniform assumption).
Item Experimenting with Pattern Matching Algorithms(1994) Manolopoulos, Yannis; Faloutsos, Christos; ISRTwo new pattern matching algorithms based on the Boyer-Moore algorithm are presented. Their performance is compared to that of earlier relevant variants in terms of the number of character comparisons and the required running time by exhaustive simulation. Experimental results show the efficiency of both these two new algorithms.Item Expert Database Systems: Efficient Support for Engineering Environments.(1987) Sellis, T.; Roussopoulos, N.; Mark, Leo; Faloutsos, Christos; ISRManufacturing and Engineering processes use both large scale data and knowledge bases, and the use of expert systems in such environments has become a necessity. Expert Database Systems have evolved from conventional database systems to meet the requirements of current Artificial Intelligence applications. However, future Expert Database Systems will contain knowledge bases of significant size which makes main memory insufficient and the use of a database system a necessity. We propose an effective way of building High Performance Expert Database Systems to support manufacturing and engineering environments. These systems are based on Incremental Computation Models; such models utilize results of previous computations by merging them with newly derived results of computations on small increments representing changes in the database. Our system will be able to support very large knowledge bases by utilizing novel structures and acceas methods and by using a very sophisticated inference engine based on incremental computation models.Item Fast Map: A Fast Algorithms for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets(1994) Faloutsos, Christos; Lin, King-Ip D.; ISRA very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [Jag91]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the uery By Example' type (which translates to a range query); the ll pairs' query (which translates to a spatial join [BKSS94]); the nearest-neighbor or best match query, etc.However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.
This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k- dimensional space ( k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, ad discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.
We introduce an older method from pattern recognition, namely, multi-Dimentional Scaling (MIDS) [Tor52]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MIDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.
Item Fast Nearest Neighbor Search in Medical Image Databases(1998-10-15) Korn, Flip; Sidiropoulos, Nikolaos; Faloutsos, Christos; Siegel, Eliot; Protopapas, ZenonWe examine the problem of finding similar tumor shapes. Starting from a natural similarity function (the so-called `max morpholog- ical distance'), we showed how to lower-bound it and how to search for nearest neighbors in large collections of tumor-like shapes. Specifically, we used state-of-the-art concepts from morphology, namely the `pattern spectrum' of a shape, to map each shape to a point in $n$-dimensional space. Following \cite{Faloutsos94Fast,Jagadish91Retrieval}, we organized the $n$-d points in an R-tree. We showed that the $L_infty$ (= max) norm in the $n$-d space lower-bounds the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest-neighbor algorithm that also guarantees no false dismissals. Finally, we implemented the method, and we tested it against a testbed of realistic tumor shapes, using an established tumor- growth model of Murray Eden \cite{Eden:61}. The experiments showed that our method is up to 27 times faster than straightfor- ward sequential scanning. (Also cross-referenced as UMIACS-TR-96-17)Item Fast Nearest Neighbor Search in Medical Image Databases(1996) Korn, Flip; Sidiropoulos, N.; Faloutsos, Christos; ISRWe examine the problem of finding similar tumor shapes. Starting from a natural similarity function (the so-called ax morphological distance'), we showed how to lower-bound it and how to search for nearest neighbors in large collections of tumor- like shapes.Specifically, we used state-of-the-art concepts from morphology, namely the attern spectrum' of a shape, to map each shape to a point in n-dimensional space. Following [19, 36], we organized the n-d points in an R-tree. We showed that the L (= max) norm in the n-d space lower-bounds the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest neighbor algorithm that also guarantees no false dismissals.
Finally, we implemented the method, and we tested it against a testbed of realistic tumor shapes, using an established tumor-growth model of Murray Eden [15]. The experiments showed that our method is up to 27 times faster than straightforward sequential scanning.
Item Fast Subsequence Matching in Time-Series Data bases(1993) Faloutsos, Christos; Ranganathan, M.; Manolopoulos, Yannis; ISRWe present an efficient indexing method to locate subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of boxes in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. More detailed, we use a sliding window over the data sequence and extract its features; the results is a trail in feature space. We propose an efficient and effective algorithm to divide such trail in sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.Item Fast Subsequence Matching in Time-Series Databases(1998-10-15) Faloutsos, Christos; Ranganathan, M.; Manolopoulos, YannisWe present an efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of multidimensional rectangles in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree \cite{Beckmann90R}. In more detail, we use a sliding window over the data sequence and extract its features; the result is a trail in feature space. We propose an efficient and effective algorithm to divide such trails into sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times. Appeared in ACM SIGMOD 1994, pp 419-429. Given "Best Paper award" (Also cross-referenced as UMIACS-TR-93-131)Item FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets(1998-10-15) Faloutsos, Christos; Lin, King-Ip (David)A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert rJag91]. Thus. we can subsequently use highly fine-tuned spatia l access methods (SAMs), to answer several types of queries, including the 'Query By Example' type (which translates to a range query); the 'all pairs' query (which translates to a spatial join [BKSS94]); the nearest-neighbor or best-match query, etc. However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points. This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k-dimensional space (k is user-defined), such that the dissimilarities are preserved. There are two benefits from this mapping: (a) efficient retriev al, in conjunction with a SAM, as discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or Sd space, revealing potential clusters, correlations among attributes and other regularities that data-mining is l ooking for. We introduce an older method from pattern recognition, namely, Multi-Dimcnsional Scaling (MDS) [Tor52]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances an d the overall structure of the data-set. (Also cross-referenced as UMIACS-TR-94-132)