Tech Reports in Computer Science and Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/5
The technical reports collections in this community are deposited by the Library of the Computer Science department. If you have questions about these collections, please contact library staff at library@cs.umd.edu
Browse
37 results
Search Results
Item An Effective Approach to Temporally Anchored Information Retrieval(2012-08-17) Wei, Zheng; JaJa, JosephWe consider in this paper the information retrieval problem over a collection of time-evolving documents such that the search has to be carried out based on a query text and a temporal specification. A solution to this problem is critical for a number of emerging large scale applications involving archived collections of web contents, social network interactions, blog traffic, and information feeds. Given a collection of time-evolving documents, we develop an effective strategy to create inverted files and indexing structures such that a temporally anchored query can be processed fast using similar strategies as in the non-temporal case. The inverted files generated have exactly the same structure as those generated for the classical (non-temporal) case, and the size of the additional indexing structures is shown to be small. Well-known previous algorithms for constructing inverted files or for computing relevance can be extended to handle the temporal case. Moreover, we present high throughput, scalable parallel algorithms to build the inverted files with the additional indexing structures on multicore processors and clusters of multicore processors. We illustrate the effectiveness of our approach through experimental tests on a number of web archives, and include a comparison of space used by the indexing structures and postings lists and search time between our approach and the traditional approach that ignores the temporal information.Item Constructing Inverted Files: To MapReduce or Not Revisited(2012-01-26) Wei, Zheng; JaJa, JosephCurrent high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughput pipelined strategy that produces parallel parsed streams, which are immediately consumed at the same rate by parallel indexers. We have performed extensive tests of our algorithm on a cluster of 32 nodes, and were able to achieve a throughput close to the peak throughput of the I/O system: a throughput of 280 MB/s on a single node and a throughput that ranges between 5.15 GB/s (1 Gb/s Ethernet interconnect) and 6.12GB/s (10Gb/s InfiniBand interconnect) on a cluster with 32 nodes for processing the ClueWeb09 dataset. Such a performance represents a substantial gain over the best known MapReduce algorithms even when comparing the single node performance of our algorithm to MapReduce algorithms running on large clusters. Our results shed a light on the extent of the performance cost that may be incurred by using the simpler, higher-level MapReduce programming model for large scale applications.Item Constructing Inverted Files on a Cluster of Multicore Processors Near Peak I/O Throughput(2011-03-03) Wei, Zheng; JaJa, JosephWe develop a new strategy for processing a collection of documents on a cluster of multicore processors to build the inverted files at almost the peak I/O throughput of the underlying system. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure that enables efficient parallel construction of the global dictionary; and (iii) a partitioning strategy of the work of the indexers using random sampling, which achieve extremely good load balancing with minimal communication overhead. We have performed extensive tests of our algorithm on a cluster of 32 nodes, each consisting of two Intel Xeon X5560 Quad-core, and were able to achieve a throughput close to the peak throughput of the I/O system. In particular, we achieve a throughput of 280 MB/s on a single node and a throughput of 6.12GB/s on a cluster with 32 nodes for processing the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those running on much larger clusters.Item Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA(2010-07-13) Wei, Zheng; JaJa, JosephWe present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060.Item Effective Strategies for Temporally Anchored Information Retrieval(2010-05-28) Song, Sangchul; JaJa, JosephA number of emerging large scale applications such as web archiving and time-stamped web objects generated through information feeds involve time-evolving objects that can be most effectively explored through search within a temporal context. We develop in this paper a new approach to handle the temporal text search of a time evolving collection of documents. Specifically, given a temporally anchored query, our method will return a ranked set of documents that were live during the query time span and the relevance scores are computed relative to the state of the collection as it existed during the query time span. Our approach introduces both a new indexing organization that substantially limits the search space and an effective methodology for computing the temporally anchored relevance scores. Moreover, we develop an analytical model that can be used to determine the temporal granularity of the indexing organization which minimizes the total number of postings examined during query evaluation. Our approach is validated through extensive empirical results generated using two very different and significant datasets.Item Archiving Temporal Web Information: Organization of Web Contents for Fast Access and Compact Storage(2008-04-07) Song, Sangchul; JaJa, JosephWe address the problem of archiving dynamic web contents over significant time spans. Current schemes crawl the web contents at regular time intervals and archive the contents after each crawl regardless of whether or not the contents have changed between consecutive crawls. Our goal is to store newly crawled web contents only when they are different than the previous crawl, while ensuring accurate and quick retrieval of archived contents based on arbitrary temporal queries over the archived time period. In this paper, we develop a scheme that stores unique temporal web contents in containers following the widely used ARC/WARC format, and that provides quick access to the archived contents for arbitrary temporal queries. A novel component of our scheme is the use of a new indexing structure based on the concept of persistent or multi-version data structures. Our scheme can be shown to be asymptotically optimal both in storage utilization and insert/retrieval time. We illustrate the performance of our method on two very different data sets from the Stanford WebBase project, the first reflecting very dynamic web contents and the second relatively static web contents. The experimental results clearly illustrate the substantial storage savings achieved by eliminating duplicate contents detected between consecutive crawls, as well as the speed at which our method can find the archived contents specified through arbitrary temporal queries.Item Web Archiving: Organizing Web Objects into Web Containers to Optimize Access(2007-10-09) Song, Sangchul; JaJa, JosephThe web is becoming the preferred medium for communicating and storing information pertaining to almost any human activity. However it is an ephemeral medium whose contents are constantly changing, resulting in a permanent loss of part of our cultural and scientific heritage on a regular basis. Archiving important web contents is a very challenging technical problem due to its tremendous scale and complex structure, extremely dynamic nature, and its rich heterogeneous and deep contents. In this paper, we consider the problem of archiving a linked set of web objects into web containers in such a way as to minimize the number of containers accessed during a typical browsing session. We develop a method that makes use of the notion of PageRank and optimized graph partitioning to enable faster browsing of archived web contents. We include simulation results that illustrate the performance of our scheme and compare it to the common scheme currently used to organize web objects into web containers.Item Techniques to Audit and Certify the Long Term Integrity of Digital Archives(2007-08-14) Song, Sangchul; JaJa, JosephA large portion of the government, business, cultural, and scientific digital data being created today needs to be archived and preserved for future use of periods ranging from a few years to decades and sometimes centuries. A fundamental requirement for a long term archive is to set up mechanisms that will ensure the authenticity of the holdings of the archive. In this paper, we develop a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. Our approach involves the generation of a small-size integrity token for each digital object to be archived, and some cryptographic summary information based on all the objects handled within a dynamic time period. We present a framework that enables the continuous auditing of the holdings of the archive, as well as auditing upon access, depending on the policy set by the archive. Moreover, an independent auditor will be able to verify the integrity of every version of an archived digital object as well as link the current version to the original form of the object when it was ingested into the archive. Using this approach, a prototype system called ACE (Auditing Control Environment) has been built and tested. ACE is scalable and cost effective, and is completely independent of the archive's underlying architecture.Item ACE: A Novel Software Platform to Ensure the Integrity of Long Term Archives(2007-01-31) Song, Sangchul; JaJa, JosephWe develop a new methodology to address the integrity of long term archives using rigorous cryptographic techniques. A prototype system called ACE (Auditing Control Environment) was designed and developed based on this methodology. ACE creates a small-size integrity token for each digital object and some cryptographic summary information based on all the objects handled within a dynamic time period. ACE continuously audits the contents of the various objects according to the policy set by the archive, and provides mechanisms for an independent third-party auditor to certify the integrity of any object. In fact, our approach will allow an independent auditor to verify the integrity of every version of an archived digital object as well as link the current version to the original form of the object when it was ingested into the archive. We show that ACE is very cost effective and scalable while making no assumptions about the archive architecture. We include in this paper some preliminary results on the validation and performance of ACE on a large image collection.Item A Novel Information-Aware Octree for the Visualization of Large Scale Time-Varying Data(2006-04-20T16:32:32Z) Kim, Jusub; JaJa, JosephLarge scale scientific simulations are increasingly generating very large data sets that present substantial challenges to current visualization systems. In this paper, we develop a new scalable and efficient scheme for the visual exploration of 4-D isosurfaces of time varying data by rendering the 3-D isosurfaces obtained through an arbitrary axis-parallel hyperplane cut. The new scheme is based on: (i) a new 4-D hierarchical indexing structure, called Information-Aware Octree; (ii) a controllable delayed fetching technique; and (iii) an optimized data layout. Together, these techniques enable efficient and scalable out-of-core visualization of large scale time varying data sets. We introduce an entropy-based dimension integration technique by which the relative resolutions of the spatial and temporal dimensions are established, and use this information to design a compact size 4-D hierarchical indexing structure. We also present scalable and efficient techniques for out-of-core rendering. Compared with previous algorithms for constructing 4-D isosurfaces, our scheme is substantially faster and requires much less memory. Compared to the Temporal Branch-On-Need octree (T-BON), which can only handle a subset of our queries, our indexing structure is an order of magnitude smaller and is at least as effective in dealing with the queries that the T-BON can handle. We have tested our scheme on two large time-varying data sets and obtained very good performance for a wide range of isosurface extraction queries using an order of magnitude smaller indexing structures than previous techniques. In particular, we can generate isosurfaces at intermediate time steps very quickly.