Streaming and Sketch Algorithms for Large Data NLP

Goyal, Amit

Streaming and Sketch Algorithms for Large Data NLP

dc.contributor.advisor	Daume III, Hal	en_US
dc.contributor.author	Goyal, Amit	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2013-10-02T05:31:35Z
dc.date.available	2013-10-02T05:31:35Z
dc.date.issued	2013	en_US
dc.description.abstract	The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus.	en_US
dc.identifier.uri	http://hdl.handle.net/1903/14451
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	Count-Min sketch	en_US
dc.subject.pquncontrolled	Fast Large Scale Approximate Graph (FLAG)	en_US
dc.subject.pquncontrolled	Large data NLP	en_US
dc.subject.pquncontrolled	Nearest Neighbor Search	en_US
dc.subject.pquncontrolled	Sketch algorithms	en_US
dc.subject.pquncontrolled	Streaming	en_US
dc.title	Streaming and Sketch Algorithms for Large Data NLP	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Goyal_umd_0117E_14369.pdf
Size:: 3.05 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations