Multi-Stage Search Architectures for Streaming Documents

Asadi, Nima

Multi-Stage Search Architectures for Streaming Documents

dc.contributor.advisor	Lin, Jimmy	en_US
dc.contributor.author	Asadi, Nima	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2013-10-02T05:31:06Z
dc.date.available	2013-10-02T05:31:06Z
dc.date.issued	2013	en_US
dc.description.abstract	The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/14443
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Information Retrieval	en_US
dc.subject.pquncontrolled	Learning-to-Rank	en_US
dc.subject.pquncontrolled	Scalability and Efficiency	en_US
dc.subject.pquncontrolled	Search Architectures	en_US
dc.subject.pquncontrolled	Streaming Documents	en_US
dc.title	Multi-Stage Search Architectures for Streaming Documents	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Asadi_umd_0117E_14340.pdf
Size:: 1.37 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations