Multi-Stage Search Architectures for Streaming Documents

dc.contributor.advisorLin, Jimmyen_US
dc.contributor.authorAsadi, Nimaen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.description.abstractThe web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality.en_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledInformation Retrievalen_US
dc.subject.pquncontrolledScalability and Efficiencyen_US
dc.subject.pquncontrolledSearch Architecturesen_US
dc.subject.pquncontrolledStreaming Documentsen_US
dc.titleMulti-Stage Search Architectures for Streaming Documentsen_US


Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.37 MB
Adobe Portable Document Format