Institute for Systems Research

Permanent URI for this communityhttp://hdl.handle.net/1903/4375

Browse

Search Results

Now showing 1 - 10 of 10
  • Thumbnail Image
    Item
    Fast Nearest Neighbor Search in Medical Image Databases
    (1996) Korn, Flip; Sidiropoulos, N.; Faloutsos, Christos; ISR
    We examine the problem of finding similar tumor shapes. Starting from a natural similarity function (the so-called ax morphological distance'), we showed how to lower-bound it and how to search for nearest neighbors in large collections of tumor- like shapes.

    Specifically, we used state-of-the-art concepts from morphology, namely the attern spectrum' of a shape, to map each shape to a point in n-dimensional space. Following [19, 36], we organized the n-d points in an R-tree. We showed that the L (= max) norm in the n-d space lower-bounds the actual distance. This guarantees no false dismissals for range queries. In addition, we developed a nearest neighbor algorithm that also guarantees no false dismissals.

    Finally, we implemented the method, and we tested it against a testbed of realistic tumor shapes, using an established tumor-growth model of Murray Eden [15]. The experiments showed that our method is up to 27 times faster than straightforward sequential scanning.

  • Thumbnail Image
    Item
    Similarity Searching in Large Image DataBases
    (1994) Petrakis, E.G.M.; Faloutsos, Christos; ISR
    We propose a method to handle approximate searching by image content in large image databases. Image content is represented by attributed relational graphs holding features of objects and relationships between objects. The method relies on the assumption that a fixed number of ﲬabeled or ﲥxpected objects (e.g. ﲨeart lungs etc.) are common in all images of a given application domain in addition to a variable number of ﲵnexpected or ﲵnlabeled objects (e.g. ﲴumor , hematoma etc.). The method can answer queries by example such as ﲦind all X-rays that are similar to Smith's X-ray . The stored images are mapped to points in a multidimentional space ad are indexed using state- of-the-art database methods (R-trees). The proposed method has several desirable desirable properties: (a) Database search is approximate so that all images up to a pre-specified degree of similarity (tolerance) are retrieved, (b) it has no ﲦalse dismissals (i.e., all images qualifying query selection criteria are retrieved) and (c) it scales-up well as the database grows. We implemented the method and ran experiments on a database of synthetic (but realistic) medical images. The experiments showed that our method significantly outperforms sequential scanning by up to an order of magnitude.
  • Thumbnail Image
    Item
    Fast Map: A Fast Algorithms for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets
    (1994) Faloutsos, Christos; Lin, King-Ip D.; ISR
    A very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert [Jag91]. Thus, we can subsequently use highly fine-tuned spatial access methods (SAMs), to answer several types of queries, including the uery By Example' type (which translates to a range query); the ll pairs' query (which translates to a spatial join [BKSS94]); the nearest-neighbor or best match query, etc.

    However, designing feature extraction functions can be hard. It is relatively easier for a domain expert to assess the similarity/distance of two objects. Given only the distance information though, it is not obvious how to map objects into points.

    This is exactly the topic of this paper. We describe a fast algorithm to map objects into points in some k- dimensional space ( k is user-defined), such that the dis-similarities are preserved. There are two benefits from this mapping: (a) efficient retrieval, in conjunction with a SAM, ad discussed before and (b) visualization and data-mining: the objects can now be plotted as points in 2-d or 3-d space, revealing potential clusters, correlations among attributes and other regularities that data-mining is looking for.

    We introduce an older method from pattern recognition, namely, multi-Dimentional Scaling (MIDS) [Tor52]; although unsuitable for indexing, we use it as yardstick for our method. Then, we propose a much faster algorithm to solve the problem in hand, while in addition it allows for indexing. Experiments on real and synthetic data indeed show that the proposed algorithm is significantly faster than MIDS, (being linear, as opposed to quadratic, on the database size N), while it manages to preserve distances and the overall structure of the data-set.

  • Thumbnail Image
    Item
    The TV-tree -- an Index Structure for High-Dimensional Data
    (1994) Lin, King-Ip D.; Jagadish, H.V.; Faloutsos, Christos; ISR
    We propose a file structure to index high-dimensionality data, typically, points in some feature space. The idea is to use only a few of the features, utilizing additional features whenever the additional discriminatory power is absolutely necessary. We present in detail the design of our tree structure and the associated algorithms that handle such 'varying length' feature vectors. Finally we report simulation results, comparing the proposed structure with the R*-tree, which is one of the most successful methods for low-dimensionality spaces. The results illustrate the superiority of our method, with up to 80% savings in disk accesses.
  • Thumbnail Image
    Item
    Declustering R-Tree on Multi-Computer Architectures
    (1994) Koudas, N.; Faloutsos, Christos; Kamel, Ibrahim; ISR
    We study a method to decluster a spatial access method (and specifically an R-tree) on a shared-nothing multi-computer architecture [9]. Our first step is to propose a software architecture, with the top levels of the R-tree on the 'master'server' and the leaf nodes distributed across the servers. Nest, we study the optimal capacity of leaf nodes, or 'chunk size'. We express the response time on range queries as a function of the 'chunk size', and we show how to optimize it. This formula assumes that the 'chunks' are perfectly declustered. We propose to use the Hilbert curve to achieve such a good declustering.

    Finally, we implemented our method on a network of workstations and we compared the experimental and the theoretical results. The conclusions are that (a) our formula for the response time is accurate (the maximum relative error was 30%; the typical error was in the vicinity of 10-15%) (b) the Hilbert-based declustering consistently outperforms a random declustering (c) most importantly, although the optimal chunk size depends on several factors (database size, size of the query, speed of the network), a safe choice for it is 1 page (whichever is the page size of the operating system). We show analytically and experimentally that a chunk size of 1 page gives either optimal or close to optimal results, for a wide range of the parameters.

  • Thumbnail Image
    Item
    Fast Subsequence Matching in Time-Series Data bases
    (1993) Faloutsos, Christos; Ranganathan, M.; Manolopoulos, Yannis; ISR
    We present an efficient indexing method to locate subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance. The idea is to map each data sequence into a small set of boxes in feature space. Then, these rectangles can be readily indexed using traditional spatial access methods, like the R*-tree [9]. More detailed, we use a sliding window over the data sequence and extract its features; the results is a trail in feature space. We propose an efficient and effective algorithm to divide such trail in sub-trails, which are subsequently represented by their Minimum Bounding Rectangles (MBRs). We also examine queries of varying lengths, and we show how to handle each case efficiently. We implemented our method and carried out experiments on synthetic and real data (stock price movements). We compared the method to sequential scanning, which is the only obvious competitor. The results were excellent: our method accelerated the search time from 3 times up to 100 times.
  • Thumbnail Image
    Item
    Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension
    (1993) Faloutsos, Christos; Kamel, Ibrahim; ISR
    We propose the concept of fractal dimension of a set of points, in order to quantify the deviation from the uniformity distribution. Using measurements on real data sets (road intersections of U.S. counties, star coordinates from NASA's Infrared-Ultraviolet Explorer etc.) we provide evidence that real data indeed are skewed, and, moreover, we show that they behave as mathematical fractals, with a measurable, non-integer fractal dimension.

    Armed with this tool, we then show its practical use in predicting the performance of spatial access methods, and specifically of the R-trees. We provide the first analysis of R- trees for skewed distributions of points: We develop a formula that estimates the number of disk accesses for range queries, given only the fractal dimension of the point set, and its count. Experiments on real data sets show that the formula is very accurate: the relative error is usually below 5%, and it rarely exceeds 10%.

    We believe that the fractal dimension will help replace the uniformity and independence assumptions, allowing more accurate analysis for any spatial access method, as well as better estimates for query optimization on multi-attribute queries.

  • Thumbnail Image
    Item
    Hilbert R-Tree: An Improved R-Tree Using Fractals
    (1993) Kamel, Ibrahim; Faloutsos, Christos; ISR
    We propose a new R-tree structure that outperforms all the older ones. The heart of the idea is to impose a linear ordering on the data rectangles. This ordering has to be 'good', in the sense that it should cluster 'similar' data rectangles together, to minimize the area and perimeter of the resulting minimum bounding rectangles (MBRs). Among the orderings we tried, the '2D-c' method, the one that uses the (2d) hilbert value of the center of the rectangles, gives the best results.

    For a static database, the proposed ordering achieves superior packing, outperforming older packing methods [25], and the best dynamic method (R*-trees [3]). The savings are as high as 36% on real data.

    more importantly, we introduce a dynamic variation, the Hilbert R- tree: : Given the ordering, every node has a well- defined set of sibling nodes; thus, we can deploy the deferred splitting algorithms of the B* -trees. By adjusting the split policy, the Hilbert R-tree can achieve as high utilization as desired. We show that a '3-to-4' split policy achieves good results, consistently outperforming the R* -trees, with up to 28% savings on real data.

  • Thumbnail Image
    Item
    Packed R-trees Using Fractals
    (1993) Faloutsos, Christos; Kamel, Ibrahim; ISR
    We propose a new packing technique for R-trees for static databases. Given a collection of rectangles, we sort them and we build the R-tree bottom-up. There are several ways to sort the rectangles; the innovation of this work is the use of fractals, and specifically the hilbert curve, to achieve better ordering of the rectangles and eventually better packing. We proposed and implemented several variations and performed experiments on synthetic, as well as real data ( a TIGER file from the U.S. Bureau of Census). The winning variation ('2D-c') was the one that sorts the rectangles according to the hilbert value of the center. This variation consistently outperforms the R-trees [3], Guttman's R-trees [13], as well as the packing method of Roussoupoulos and Leifker [24]. The performance gain of the our method seems to increase with the skeweness of the data distribution; specifically, on the (highly skewed) TIGER dataset, it achieves up to 36% improvement in response time over the best know R-tree variant and up to 58% over the older packing algorithm.
  • Thumbnail Image
    Item
    Diamond-Tree: An Index Structure for High-Dimensionality Approximate Searching
    (1992) Faloutsos, Christos; Jagadish, H.V.; ISR
    A selection query applied to a database often has the selection predicate imperfectly specified. We present a technique, called the Diamond-tree, for indexing fields to perform similarity-based retrieval, given some applicable measures of approximation. Typically, the number of features (or dimensions of similarity) is large, so that the search space has a high-dimensionality, and most traditional methods perform poorly. As a test case, we show how the Diamond-tree technique can be used to perform retrievals based on incorrectly or approximately specified values for string fields. Experimental results show that our method can respond to approximately match queries by examining a small portion (1% - 5%) of the database.