Browsing by Author "JaJa, Joseph"
Now showing 1 - 20 of 37
Results Per Page
Sort Options
Item ACE: A Novel Software Platform to Ensure the Integrity of Long Term Archives(2007-01-31) Song, Sangchul; JaJa, JosephWe develop a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. A prototype system called ACE (Auditing Control Environment) was designed and developed based on this methodology. ACE creates a small-size integrity token for each digital object and some cryptographic summary information based on all the objects handled within a dynamic time period. ACE continuously audits the contents of the various objects according to the policy set by the archive, and provides mechanisms for an independent third-party auditor to certify the integrity of any object. In fact, our approach will allow an independent auditor to verify the integrity of every version of an archived digital object as well as link the current version to the original form of the object when it was ingested into the archive. We show that ACE is very cost effective and scalable while making no assumptions about the archive architecture. We include in this paper some preliminary results on the validation and performance of ACE on a large image collection.Item Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage(2008-04-07) Song, Sangchul; JaJa, JosephWe address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. Our goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, we develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries. A novel component of our scheme is the use of a new indexing structure based on the concept of persistent or multi-version data structures. Our scheme can be shown to be asymptotically optimal both in storage utilization and insert/retrieval time. We illustrate the performance of our method on two very different data sets from the Stanford WebBase project, the first reflecting very dynamic web contents and the second relatively static web contents. The experimental results clearly illustrate the substantial storage savings achieved by eliminating duplicate contents detected between consecutive crawls, as well as the speed at which our method can find the archived contents specified through arbitrary temporal queries.Item The Block Distributed Memory Model(1998-10-15) JaJa, Joseph; Ryu, Kwan WooWe introduce a computation model for developing and analyzing parallel algorithms on distributed memory machines. The model allows the design of algorithms using a single address space and does not assume any particular interconnection topology. We capture performance by incorporating a cost measure for interprocessor communication induced by remote memory accesses. The cost measure includes parameters reflecting memory latency, communication bandwidth, and spatial locality. Our model allows the initial placement of the input data and pipelined prefetching. We use our model to develop parallel algorithms for various data rearrangement problems, load balancing, sorting, FFT, and matrix multiplication. We show that most of these algorithms achieve optimal or near optimal communication complexity while simultaneously guaranteeing an optimal speed-up in computational complexity. (Also cross-referenced as UMIACS-TR-94-5.)Item Constructing Inverted Files on a Cluster of Multicore Processors Near Peak I/O Throughput(2011-03-03) Wei, Zheng; JaJa, JosephWe develop a new strategy for processing a collection of documents on a cluster of multicore processors to build the inverted files at almost the peak I/O throughput of the underlying system. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure that enables efficient parallel construction of the global dictionary; and (iii) a partitioning strategy of the work of the indexers using random sampling, which achieve extremely good load balancing with minimal communication overhead. We have performed extensive tests of our algorithm on a cluster of 32 nodes, each consisting of two Intel Xeon X5560 Quad-core, and were able to achieve a throughput close to the peak throughput of the I/O system. In particular, we achieve a throughput of 280 MB/s on a single node and a throughput of 6.12GB/s on a cluster with 32 nodes for processing the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those running on much larger clusters.Item Constructing Inverted Files: To MapReduce or Not Revisited(2012-01-26) Wei, Zheng; JaJa, JosephCurrent high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughput pipelined strategy that produces parallel parsed streams, which are immediately consumed at the same rate by parallel indexers. We have performed extensive tests of our algorithm on a cluster of 32 nodes, and were able to achieve a throughput close to the peak throughput of the I/O system: a throughput of 280 MB/s on a single node and a throughput that ranges between 5.15 GB/s (1 Gb/s Ethernet interconnect) and 6.12GB/s (10Gb/s InfiniBand interconnect) on a cluster with 32 nodes for processing the ClueWeb09 dataset. Such a performance represents a substantial gain over the best known MapReduce algorithms even when comparing the single node performance of our algorithm to MapReduce algorithms running on large clusters. Our results shed a light on the extent of the performance cost that may be incurred by using the simpler, higher-level MapReduce programming model for large scale applications.Item Designing Practical Efficient Algorithms for Symmetric Multiprocessors(1998-10-15) Helman, David R.; JaJa, JosephSymmetric multiprocessors (SMPs) dominate the high-end server market and are currently the primary candidate for constructing large scale multiprocessor systems. Yet, the design of efficient parallel algorithms for this platform currently poses several challenges. In this paper, we present a computational model for designing efficient algorithms for symmetric multiprocessors. We then use this model to create efficient solutions to two widely different types of problems - linked list prefix computations and generalized sorting. Our novel algorithm for prefix computations builds upon the sparse ruling set approach of Reid-Miller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, we can bound our complexity with high probability instead of merely on average. Our algorithm for generalized sorting is a modification of our algorithm for sorting by regular sampling on distributed memory architectures. The algorithm is a stable sort which appears to be asymptotically faster than any of the published algorithms for SMPs. Both of our algorithms were implemented in C using POSIX threads and run on three symmetric multiprocessors - the DEC AlphaServer, the Silicon Graphics Power Challenge, and the HP-Convex Exemplar. We ran our code for each algorithm using a variety of benchmarks which we identified to examine the dependence of our algorithm on memory access patterns. In spite of the fact that the processors must compete for access to main memory, both algorithms still yielded scalable performance up to 16 processors, which was the largest platform available to us. For some problems, our prefix computation algorithm actually matched or exceeded the performance of the best sequential solution using only a single thread. Similarly, our generalized sorting algorithm always beat the performance of sequential merge sort by at least an order of magnitude, even with a single thread. (Also cross-referenced as UMIACS-TR-98-44)Item An Effective Approach to Temporally Anchored Information Retrieval(2012-08-17) Wei, Zheng; JaJa, JosephWe consider in this paper the information retrieval problem over a collection of time-evolving documents such that the search has to be carried out based on a query text and a temporal specification. A solution to this problem is critical for a number of emerging large scale applications involving archived collections of web contents, social network interactions, blog traffic, and information feeds. Given a collection of time-evolving documents, we develop an effective strategy to create inverted files and indexing structures such that a temporally anchored query can be processed fast using similar strategies as in the non-temporal case. The inverted files generated have exactly the same structure as those generated for the classical (non-temporal) case, and the size of the additional indexing structures is shown to be small. Well-known previous algorithms for constructing inverted files or for computing relevance can be extended to handle the temporal case. Moreover, we present high throughput, scalable parallel algorithms to build the inverted files with the additional indexing structures on multicore processors and clusters of multicore processors. We illustrate the effectiveness of our approach through experimental tests on a number of web archives, and include a comparison of space used by the indexing structures and postings lists and search time between our approach and the traditional approach that ignores the temporal information.Item Effective Strategies for Temporally Anchored Information Retrieval(2010-05-28) Song, Sangchul; JaJa, JosephA number of emerging large scale applications such as web archiving and time-stamped web objects generated through information feeds involve time-evolving objects that can be most effectively explored through search within a temporal context. We develop in this paper a new approach to handle the temporal text search of a time evolving collection of documents. Specifically, given a temporally anchored query, our method will return a ranked set of documents that were live during the query time span and the relevance scores are computed relative to the state of the collection as it existed during the query time span. Our approach introduces both a new indexing organization that substantially limits the search space and an effective methodology for computing the temporally anchored relevance scores. Moreover, we develop an analytical model that can be used to determine the temporal granularity of the indexing organization which minimizes the total number of postings examined during query evaluation. Our approach is validated through extensive empirical results generated using two very different and significant datasets.Item Efficient Algorithms for Atmospheric Correction of Remotely Sensed Data(1998-10-15) Fallah-Adl, Hassan; JaJa, Joseph; Liang, Shunlin; Kaufman, Yoram J.; Townshend, JohnRemotely sensed imagery has been used for developing and validating vairous studies regarding land cover dynamics such as global carbon modeling, biogeochemical cycling, hydrological modeling, and ecosystem response modeling. However, the large amounts of imagery collected by the satellites are largely contaminated by the effects of atmospheric particles through absorption and scattering of the radiation from the earth surface. The objective of atmospheric correction is to retrieve the surface reflectance (that characterizes the surface properties) from remotely sensed imagery by removing the atmospheric effects. Atmospheric correction has been shown to significantly improve the accuracy of image classification. (Also cross-referenced as UMIACS-TR-95-53)Item Efficient Isosurface Extraction for Large Scale Time-Varying Data Using the Persistent Hyperoctree (PHOT)(2006-01-13T20:31:09Z) Shi, Qingmin; JaJa, JosephWe introduce the Persistent HyperOcTree (PHOT) to handle the 4D isocontouring problem for large scale time-varying data sets. This novel data structure is provably space efficient and optimal in retrieving active cells. More importantly, the set of active cells for any possible isovalue are already organized in a Compact Hyperoctree, which enables very efficient slicing of the isocontour along spatial and temporal dimensions. Experimental results based on the very large Richtmyer-Meshkov instability data set demonstrate the effectiveness of our approach. This technique can also be used for other isosurfacing schemes such as view-dependent isosurfacing and ray-tracing, which will benefit from the inherent hierarchical structure associated with the active cells.Item Fast Algorithms for 3-D Dominance Reporting and Counting(2003-02-05) Shi, Qingmin; JaJa, JosephWe present in this paper fast algorithms for the 3-D dominance reporting and counting problems, and generalize the results to the d-dimensional case. Our 3-D dominance reporting algorithm achieves $O(\log n/\log\log n +f)$ query time using $O(n\log^{\epsilon}n)$ space, where $f$ is the number of points satisfying the query and $\epsilon>0$ is an arbitrary small constant. For the 3-D dominance counting problem (which is equivalent to the 3-D range counting problem), our algorithm runs in $O((\log n/\log\log n)^2)$ time using $O(nlog^{1+\epsilon}n/\log\log n)$ space. Also UMIACS-TR-2003-06Item Fast Algorithms for Estimating Aerosol Optical Depth and Correcting Thematic Mapper Imagery(1998-10-15) Fallah-Adl, Hassan; JaJa, Joseph; Liang, ShunlinRemotely sensed images collected by the satellites are usually contaminated by the effects of the atmospheric particles through absorption and scattering of the radiation from the earth surface. The objective of atmospheric correction is to retrieve the surface reflectance from remotely sensed imagery by removing the atmospheric effects, which is usually performed in two steps. First, the optical characteristics of the atmosphere are estimated and then the remotely sensed imagery is corrected by inversion procedures that derive the surface reflectance. In this paper we introduce an efficient algorithm to estimate the optical characteristics of the Thematic Mapper (TM) imagery and to remove the atmospheric effects from it. Our algorithm introduces a set of techniques to significantly improve the quality of the retrieved images. We pay a particular attention to the computational efficiency of the algorithm, thereby allowing us to correct large TM images quite fast. We also provide a parallel implementation of our algorithm and show its portability and its scalability on several parallel machines. (Also cross-referenced as UMIACS-TR-95-113)Item Fast Fractional Cascading and Its Applications(2003-08-01) Shi, Qingmin; JaJa, JosephUsing the notions of Q-heaps and fusion trees developed by Fredman and Willard, we develop a faster version of the fractional cascading technique while maintaining the linear space structure. The new version enables sublogarithmic iterative search in the case when we have a search tree and the degree of each node is bounded by $O(\log^{\epsilon}n)$, for some constant $\epsilon >0$, where $n$ is the total size of all the lists stored in the tree. The fast fractional cascading technique is used in combination with other techniques to derive sublogarithmic time algorithms for the geometric retrieval problems: orthogonal segment intersection and rectangular point enclosure. The new algorithms use $O(n)$ space and achieve a query time of $O(\log n/\log\log n + f)$, where $f$ is the number of objects satisfying the query. All our algorithms assume the version of the RAM model used by Fredman and Willard. (UMIACS-TR-2003-71)Item High Performance Algorithms for Global BRDF Retrieval(1998-10-15) Zhang, Zengyan; Kalluri, Satya; JaJa, Joseph; Liang, Shunlin; Townshend,Most Land cover types are ``anisotropic'', that is, the solar radiation reflected by the surface is not uniform in all directions. Characterizing the Bidirectional Reflectance Distribution Function (BRDF) of the earth's surface is critical in understanding surface anisotropy. Though there are several methods to retrieve the BRDF of various land cover types, most of them have been applied over small data sets collected either on ground or from aircraft at limited spatial and temporal scales. In this paper, we use multi-angular, multi-temporal and multi-band Pathfinder AVHRR Land (PAL) data set to retrieve the global BRDF in the red and near infrared wavelengths. The PAL data set used in our study has a spatial resolution of 8-km and 10-day composite data for four years (1983 to 1986). In particular, we develop high performance algorithms to retrieve global BRDF using three widely different models. Given the volume of data involved (about 27 GBytes), we attempt to optimize the I/O time as well as minimize the overall computational complexity. Our algorithms access the global data once, followed by a redistribution of land pixel data to balance the computational loads among the different nodes of a multiprocessor system. This strategy results in an optimized I/O access time with efficiently balanced computations across the nodes. Experimental data on a 16-node IBM SP2 is used to support these claims and to illustrate the scalability of our algorithms. (Also cross-referenced as UMIACS-TR-97-55)Item High Performance Computing Algorithms for Land mCover Dynamics Using Remote Sensing Data(1999-02-10) Kalluri, Satya; JaJa, Joseph; Bader, David A.; Zhang, Zengyan; Townshend, John; Fallah-adl, HassanGlobal and regional land cover studies require the ability to apply complex models on selected subsets of large amounts of multi-sensor and multi-temporal data sets that have been derived from raw instrument measurements using widely accepted pre-processing algorithms. The computational and storage requirements of most such studies far exceed what is possible on a single workstation environment. We have been pursuing a new approach that couples scalable and open distributed heterogeneous hardware with the development of high performance software for processing, indexing, and organizing remotely sensed data. Hierarchical data management tools are used to ingest raw data, create metadata, and organize the archived data so as to automatically achieve computational load balancing among the available nodes and minimize I/O overheads. We illustrate our approach with four specific examples. The first is the development of the first fast operational scheme for the atmospheric correction of Landsat TM scenes, while the second example focuses on image segmentation using a novel hierarchical connected components algorithm. Retrieval of global BRDF (Bidirectional Reflectance Distribution Function) in the red and near infrared wavelengths using four years (1983 to 1986) of Pathfinder AVHRR Land (PAL) data set is the focus of our third example. The fourth example is the development of a hierarchical data organization scheme that allows on-demand processing and retrieval of regional and global AVHRR data sets. Our results show that substantial improvements in computational times can be achieved by using the high performance computing technology. (Also cross-referenced as UMIACS-TR-98-18)Item A New Deterministic Parallel Sorting Algorithm With an Experimental Evaluation(1998-10-15) Helman, David R.; JaJa, Joseph; Bader, David A.We introduce a new deterministic parallel sorting algorithm based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in Split C, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. (Also cross-referenced as UMIACS-TR-96-54)Item A New Framework for Addressing Temporal Range Queries and Some Preliminary Results(2003-02-05) Shi, Qingmin; JaJa, JosephGiven a set of $n$ objects, each characterized by $d$ attributes specified at $m$ fixed time instances, we are interested in the problem of designing space efficient indexing structures such that arbitrary temporal range search queries can be handled efficiently. When $m=1$, our problem reduces to the $d$-dimensional orthogonal search problem. We establish efficient data structures to handle several classes of the general problem. Our results include a linear size data structure that enables a query time of $O( \log n\log m/\log\log n + f)$ for one-sided queries when $d=1$, where $f$ is the number of objects satisfying the query. A similar result is shown for counting queries. We also show that the most general problem can results include a linear size data structure that enables a query time of $O( \log n\log m/\log\log n + f)$ for one-sided queries when $d=1$, where $f$ is the number of objects satisfying the query. A similar result is shown for counting queries. We also show that the most general problem can be solved with a polylogarithmic query time using nonlinear space data structures. Also UMIACS-TR-2003-08Item A Novel Information-Aware Octree for the Visualization of Large Scale Time-Varying Data(2006-04-20T16:32:32Z) Kim, Jusub; JaJa, JosephLarge scale scientific simulations are increasingly generating very large data sets that present substantial challenges to current visualization systems. In this paper, we develop a new scalable and efficient scheme for the visual exploration of 4-D isosurfaces of time varying data by rendering the 3-D isosurfaces obtained through an arbitrary axis-parallel hyperplane cut. The new scheme is based on: (i) a new 4-D hierarchical indexing structure, called Information-Aware Octree; (ii) a controllable delayed fetching technique; and (iii) an optimized data layout. Together, these techniques enable efficient and scalable out-of-core visualization of large scale time varying data sets. We introduce an entropy-based dimension integration technique by which the relative resolutions of the spatial and temporal dimensions are established, and use this information to design a compact size 4-D hierarchical indexing structure. We also present scalable and efficient techniques for out-of-core rendering. Compared with previous algorithms for constructing 4-D isosurfaces, our scheme is substantially faster and requires much less memory. Compared to the Temporal Branch-On-Need octree (T-BON), which can only handle a subset of our queries, our indexing structure is an order of magnitude smaller and is at least as effective in dealing with the queries that the T-BON can handle. We have tested our scheme on two large time-varying data sets and obtained very good performance for a wide range of isosurface extraction queries using an order of magnitude smaller indexing structures than previous techniques. In particular, we can generate isosurfaces at intermediate time steps very quickly.Item An O(n)-Space O(log n/log log n + f)-Query Time Algorithm for 3-D Dominance Reporting(2003-08-01) Shi, Qingmin; JaJa, JosephWe present a linear-space algorithm for handling the {\em three-dimensional dominance reporting problem}: given a set $S$ of $n$ three-dimensional points, design a data structure for $S$ so that the points in $S$ which dominate a given query point can be reported quickly. Under the variation of the RAM model introduced by Fredman and Willard~\cite{Fredman94}, our algorithm achieves $O(\log n/\log\log n+f)$ query time, where $f$ is the number of points reported. Extensions to higher dimensions are also reported. (UMIACS-TR-2003-77)Item An On-line Variable Length Binary Encoding(1998-10-15) Acharya, Tinku; JaJa, JosephWe present a methodology of an on-line variable-length binary encoding of a set of integers. The basic principle of this methodology is to maintain the prefix property amongst the codes assigned on-line to a set of integers growing dynamically. The prefix property enables unique decoding of a string of elements from this set. To show the utility of this on-line variable length binary encoding, we apply this methodology to encode the LZW codes. Application of this encoding scheme significantly improves the compression achieved by the standard LZW scheme. This encoding can be applied in other compression schemes to encode the pointers using variable-length binary codes. (Also cross-referenced as UMIACS-TR-95-39)