Multi-Stage Search Architectures for Streaming Documents
dc.contributor.advisor | Lin, Jimmy | en_US |
dc.contributor.author | Asadi, Nima | en_US |
dc.contributor.department | Computer Science | en_US |
dc.contributor.publisher | Digital Repository at the University of Maryland | en_US |
dc.contributor.publisher | University of Maryland (College Park, Md.) | en_US |
dc.date.accessioned | 2013-10-02T05:31:06Z | |
dc.date.available | 2013-10-02T05:31:06Z | |
dc.date.issued | 2013 | en_US |
dc.description.abstract | The web is becoming more dynamic due to the increasing engagement and contribution of Internet users in the age of social media. A more dynamic web presents new challenges for web search--an important application of Information Retrieval (IR). A stream of new documents constantly flows into the web at a high rate, adding to the old content. In many cases, documents quickly lose their relevance. In these time-sensitive environments, finding relevant content in response to user queries requires a real-time search service; immediate availability of content for search and a fast ranking, which requires an optimized search architecture. These aspects of today's web are at odds with how academic IR researchers have traditionally viewed the web, as a collection of static documents. Moreover, search architectures have received little attention in the IR literature. Therefore, academic IR research, for the most part, does not provide a mechanism to efficiently handle a high-velocity stream of documents, nor does it facilitate real-time ranking. This dissertation addresses the aforementioned shortcomings. We present an efficient mech- anism to index a stream of documents, thereby enabling immediate availability of content. Our indexer works entirely in main memory and provides a mechanism to control inverted list con- tiguity, thereby enabling faster retrieval. Additionally, we consider document ranking with a machine-learned model, dubbed "Learning to Rank" (LTR), and introduce a novel multi-stage search architecture that enables fast retrieval and allows for more design flexibility. The stages of our architecture include candidate generation (top k retrieval), feature extraction, and docu- ment re-ranking. We compare this architecture with a traditional monolithic architecture where candidate generation and feature extraction occur together. As we lay out our architecture, we present optimizations to each stage to facilitate low-latency ranking. These optimizations include a fast approximate top k retrieval algorithm, document vectors for feature extraction, architecture- conscious implementations of tree ensembles for LTR using predication and vectorization, and algorithms to train tree-based LTR models that are fast to evaluate. We also study the efficiency- effectiveness tradeoffs of these techniques, and empirically evaluate our end-to-end architecture on microblog document collections. We show that our techniques improve efficiency without degrading quality. | en_US |
dc.identifier.uri | http://hdl.handle.net/1903/14443 | |
dc.subject.pqcontrolled | Computer science | en_US |
dc.subject.pquncontrolled | Information Retrieval | en_US |
dc.subject.pquncontrolled | Learning-to-Rank | en_US |
dc.subject.pquncontrolled | Scalability and Efficiency | en_US |
dc.subject.pquncontrolled | Search Architectures | en_US |
dc.subject.pquncontrolled | Streaming Documents | en_US |
dc.title | Multi-Stage Search Architectures for Streaming Documents | en_US |
dc.type | Dissertation | en_US |
Files
Original bundle
1 - 1 of 1