Rich and Scalable Models for Text

nguyen, thang dai

Rich and Scalable Models for Text

dc.contributor.advisor	Boyd-Graber, Jordan	en_US
dc.contributor.advisor	Resnik, Philip	en_US
dc.contributor.author	nguyen, thang dai	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2019-09-27T05:42:45Z
dc.date.available	2019-09-27T05:42:45Z
dc.date.issued	2019	en_US
dc.description.abstract	Topic models have become essential tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches such as the moment method with tensor decomposition or matrix factorization. The anchor word algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor word algorithm is high-speed, and it has a provable theoretical guarantee: it will converge to a global solution given enough number of documents. In this thesis, we present a series of spectral models based on the anchor word algorithm to serve a broader class of datasets and to provide more abundant and more flexible modeling capacity. First, we improve the anchor word algorithm by incorporating various rich priors in the form of appropriate regularization terms. Our new regularized anchor word algorithms produce higher topic quality and provide flexibility to incorporate informed priors, creating the ability to discover topics more suited for external knowledge. Second, we enrich the anchor word algorithm with metadata-based word representation for labeled datasets. Our new supervised anchor word algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA on three sentiment datasets. Also, sentiment anchor words, which play a vital role in generating sentiment topics, provide cues to understand sentiment datasets better than unsupervised topic models. Lastly, we examine ALTO, an active learning framework with a static topic overview, and investigate the usability of supervised topic models for active learning. We develop a new, dynamic, active learning framework that combines the concept of informativeness and representativeness of documents using dynamically updating topics from our fast supervised anchor word algorithm. Experiments using three multi-class datasets show that our new framework consistently improves classification accuracy over ALTO.	en_US
dc.identifier	https://doi.org/10.13016/qngo-ycdn
dc.identifier.uri	http://hdl.handle.net/1903/25057
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pquncontrolled	active learning	en_US
dc.subject.pquncontrolled	anchor word	en_US
dc.subject.pquncontrolled	classification	en_US
dc.subject.pquncontrolled	machine learning	en_US
dc.subject.pquncontrolled	matrix factorization	en_US
dc.subject.pquncontrolled	natural language processing	en_US
dc.title	Rich and Scalable Models for Text	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: nguyen_umd_0117E_20313.pdf
Size:: 1.91 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations