Selective Search Architectures and Brute Force Scan Techniques for Summarizing Social Media Posts

Wang, Yulu

Selective Search Architectures and Brute Force Scan Techniques for Summarizing Social Media Posts

Files

Wang_umd_0117E_18754.pdf (2.05 MB)

No. of downloads: 154

Date

2017

Authors

Wang, Yulu

Advisor

Lin, Jimmy

DRUM DOI

https://doi.org/10.13016/M2GF0N05R

Abstract

There is an increasing trend of social media usage in recent years and users desire a search system that can provide relevant content related to their information needs. However, the nature of social posts presents challenges for existing search functions. Firstly, as a stream of documents is constantly arriving at a high speed, it is required that the search system provides a real-time service that can guarantee new content be available for search immediately after it arrives. Secondly, given that the data volume contains substantive redundant information, the search engine should be able to detect redundant content and eliminate it before returning a summary of relevant documents to users. In reality, it is unlikely that users desire a list of documents containing redundant information.

Previous work that supports real-time search uses an inverted index as the core of the search architecture. However, building/updating indexes in real-time is complex. Additionally, an inverted index is a non-regular data structure, and query evaluation results in data access patterns that are at odds with modern processor architectures. Where the summarization of social media posts is concerned, although plenty of work has been done in this field, there lacks a validated evaluation methodology to measure such systems’ effectiveness.

To achieve real-time search for relevant documents, in this work, we first exploit selective search architecture. Selective search divides collections based on document similarity into multiple partitions (shards), builds inverted index on each of them, and selects the partitions that are estimated to contain relevant documents for retrieval. For real-time service, our selective search architecture divides document stream into temporal segments at different granularities and performs either batch or online clustering algorithms on them. We also take advantage of word embeddings to reduce the dimensionality of document vectors for clustering to reduce computational cost. Results show that by applying this selective search technique, we are able to achieve a level of precision statistically indistinguishable from an exhaustive search while providing substantially higher search efficiency.

For query evaluation on the unpartitioned documents, we abandon inverted index and present a simple and fast indexing process based on brute force scans technique that is to search over array representations of the documents. A final result list is then generated by merging the relevant documents with those from selected partitions. We show that our brute force scan techniques outperform the traditional search engines built on inverted indexes in terms of both query latency and throughput by exploiting parallelism when the collection size is under a few million.

To summarize relevant documents, we first develop an evaluation framework, then demonstrate that the evaluation metrics proposed in this framework are capable of stably capturing user preferences in a social posts summarization system. With the evaluation metrics as a guideline, we build a system that is based on convolutional neural networks to compute document similarity and exploits single-pass clustering to perform the summarization task.

Selective search architecture, brute force scan techniques, evaluation framework and summarization implementation comprise our real-time search system, that can retrieve relevant documents and summarize the results in response to a query. Although our system is designed to specially focus on Twitter data, we believe the techniques exploited could generalize to real-time summarization for other social media posts.

URI (handle)

http://hdl.handle.net/1903/20792

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page