Technical Reports from UMIACS

Permanent URI for this collectionhttp://hdl.handle.net/1903/7

Browse

Search Results

Now showing 1 - 5 of 5

An Effective Approach to Temporally Anchored Information Retrieval
(2012-08-17) Wei, Zheng; JaJa, Joseph
We consider in this paper the information retrieval problem over a collection of time-evolving documents such that the search has to be carried out based on a query text and a temporal specification. A solution to this problem is critical for a number of emerging large scale applications involving archived collections of web contents, social network interactions, blog traffic, and information feeds. Given a collection of time-evolving documents, we develop an effective strategy to create inverted files and indexing structures such that a temporally anchored query can be processed fast using similar strategies as in the non-temporal case. The inverted files generated have exactly the same structure as those generated for the classical (non-temporal) case, and the size of the additional indexing structures is shown to be small. Well-known previous algorithms for constructing inverted files or for computing relevance can be extended to handle the temporal case. Moreover, we present high throughput, scalable parallel algorithms to build the inverted files with the additional indexing structures on multicore processors and clusters of multicore processors. We illustrate the effectiveness of our approach through experimental tests on a number of web archives, and include a comparison of space used by the indexing structures and postings lists and search time between our approach and the traditional approach that ignores the temporal information.
Constructing Inverted Files: To MapReduce or Not Revisited
(2012-01-26) Wei, Zheng; JaJa, Joseph
Current high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughput pipelined strategy that produces parallel parsed streams, which are immediately consumed at the same rate by parallel indexers. We have performed extensive tests of our algorithm on a cluster of 32 nodes, and were able to achieve a throughput close to the peak throughput of the I/O system: a throughput of 280 MB/s on a single node and a throughput that ranges between 5.15 GB/s (1 Gb/s Ethernet interconnect) and 6.12GB/s (10Gb/s InfiniBand interconnect) on a cluster with 32 nodes for processing the ClueWeb09 dataset. Such a performance represents a substantial gain over the best known MapReduce algorithms even when comparing the single node performance of our algorithm to MapReduce algorithms running on large clusters. Our results shed a light on the extent of the performance cost that may be incurred by using the simpler, higher-level MapReduce programming model for large scale applications.
Constructing Inverted Files on a Cluster of Multicore Processors Near Peak I/O Throughput
(2011-03-03) Wei, Zheng; JaJa, Joseph
We develop a new strategy for processing a collection of documents on a cluster of multicore processors to build the inverted files at almost the peak I/O throughput of the underlying system. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure that enables efficient parallel construction of the global dictionary; and (iii) a partitioning strategy of the work of the indexers using random sampling, which achieve extremely good load balancing with minimal communication overhead. We have performed extensive tests of our algorithm on a cluster of 32 nodes, each consisting of two Intel Xeon X5560 Quad-core, and were able to achieve a throughput close to the peak throughput of the I/O system. In particular, we achieve a throughput of 280 MB/s on a single node and a throughput of 6.12GB/s on a cluster with 32 nodes for processing the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those running on much larger clusters.
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA
(2010-07-13) Wei, Zheng; JaJa, Joseph
We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060.
Effective Strategies for Temporally Anchored Information Retrieval
(2010-05-28) Song, Sangchul; JaJa, Joseph
A number of emerging large scale applications such as web archiving and time-stamped web objects generated through information feeds involve time-evolving objects that can be most effectively explored through search within a temporal context. We develop in this paper a new approach to handle the temporal text search of a time evolving collection of documents. Specifically, given a temporally anchored query, our method will return a ranked set of documents that were live during the query time span and the relevance scores are computed relative to the state of the collection as it existed during the query time span. Our approach introduces both a new indexing organization that substantially limits the search space and an effective methodology for computing the temporally anchored relevance scores. Moreover, we develop an analytical model that can be used to determine the temporal granularity of the indexing organization which minimizes the total number of postings examined during query evaluation. Our approach is validated through extensive empirical results generated using two very different and significant datasets.

Technical Reports from UMIACS

Browse

Filters

Settings

Sort By

Results per page

Search Results