Constructing Inverted Files on a Cluster of Multicore Processors Near Peak I/O Throughput

dc.contributor.authorWei, Zheng
dc.contributor.authorJaJa, Joseph
dc.date.accessioned2011-03-07T04:39:46Z
dc.date.available2011-03-07T04:39:46Z
dc.date.issued2011-03-03
dc.description.abstractWe develop a new strategy for processing a collection of documents on a cluster of multicore processors to build the inverted files at almost the peak I/O throughput of the underlying system. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers; (ii) a hybrid trie and B-tree dictionary data structure that enables efficient parallel construction of the global dictionary; and (iii) a partitioning strategy of the work of the indexers using random sampling, which achieve extremely good load balancing with minimal communication overhead. We have performed extensive tests of our algorithm on a cluster of 32 nodes, each consisting of two Intel Xeon X5560 Quad-core, and were able to achieve a throughput close to the peak throughput of the I/O system. In particular, we achieve a throughput of 280 MB/s on a single node and a throughput of 6.12GB/s on a cluster with 32 nodes for processing the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those running on much larger clusters.en_US
dc.identifier.urihttp://hdl.handle.net/1903/11311
dc.language.isoen_USen_US
dc.relation.ispartofseriesUMIACS;UMIACS-TR-2011-03
dc.titleConstructing Inverted Files on a Cluster of Multicore Processors Near Peak I/O Throughputen_US
dc.typeTechnical Reporten_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
UMIACS-TR-2011-03.pdf
Size:
795.81 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.8 KB
Format:
Item-specific license agreed upon to submission
Description: