Constructing Inverted Files: To MapReduce or Not Revisited

dc.contributor.authorWei, Zheng
dc.contributor.authorJaJa, Joseph
dc.date.accessioned2012-01-30T04:09:42Z
dc.date.available2012-01-30T04:09:42Z
dc.date.issued2012-01-26
dc.description.abstractCurrent high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughput pipelined strategy that produces parallel parsed streams, which are immediately consumed at the same rate by parallel indexers. We have performed extensive tests of our algorithm on a cluster of 32 nodes, and were able to achieve a throughput close to the peak throughput of the I/O system: a throughput of 280 MB/s on a single node and a throughput that ranges between 5.15 GB/s (1 Gb/s Ethernet interconnect) and 6.12GB/s (10Gb/s InfiniBand interconnect) on a cluster with 32 nodes for processing the ClueWeb09 dataset. Such a performance represents a substantial gain over the best known MapReduce algorithms even when comparing the single node performance of our algorithm to MapReduce algorithms running on large clusters. Our results shed a light on the extent of the performance cost that may be incurred by using the simpler, higher-level MapReduce programming model for large scale applications.en_US
dc.identifier.urihttp://hdl.handle.net/1903/12171
dc.language.isoen_USen_US
dc.relation.ispartofseriesUMIACS;UMIACS-TR-2012-03
dc.titleConstructing Inverted Files: To MapReduce or Not Revisiteden_US
dc.typeTechnical Reporten_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
UMIACS-TR-2012-03.pdf
Size:
1.07 MB
Format:
Adobe Portable Document Format