Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    Models, Inference, and Implementation for Scalable Probabilistic Models of Text
    (2014) Zhai, Ke; Boyd-Graber, Jordan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. Despite their success, unsupervised probabilistic Bayesian models are often slow in inference due to inter-entangled mutually dependent latent variables. In addition, the parameter space of these models is usually very large. As the data from various different media sources--for example, internet, electronic books, digital films, etc--become widely accessible, lack of scalability for these unsupervised probabilistic Bayesian models becomes a critical bottleneck. The primary focus of this dissertation is to speed up the inference process in unsupervised probabilistic Bayesian models. There are two common solutions to scale the algorithm up to large data: parallelization or streaming. The former achieves scalability by distributing the data and the computation to multiple machines. The latter assumes data come in a stream and updates the model gradually after seeing each data observation. It is able to scale to larger datasets because it usually takes only one pass over the entire data. In this dissertation, we examine both approaches. We first demonstrate the effectiveness of the parallelization approach on a class of unsupervised Bayesian models--topic models, which are exemplified by latent Dirichlet allocation (LDA). We propose a fast parallel implementation using variational inference on the MapRe- duce framework, referred to as Mr. LDA. We show that parallelization enables topic models to handle significantly larger datasets. We further show that our implementation--unlike highly tuned and specialized implementations--is easily extensible. We demonstrate two extensions possible with this scalable framework: 1) informed priors to guide topic discovery and 2) extracting topics from a multilingual corpus. We propose polylingual tree-based topic models to infer topics in multilingual corpora. We then propose three different inference methods to infer the latent variables. We examine the effectiveness of different inference methods on the task of machine translation in which we use the proposed model to extract domain knowledge that considers both source and target languages. We apply it on a large collection of aligned Chinese-English sentences and show that our model yields significant improvement on BLEU score over strong baselines. Other than parallelization, another approach to deal with scalability is to learn parameters in an online streaming setting. Although many online algorithms have been proposed for LDA, they all overlook a fundamental but challenging problem-- the vocabulary is constantly evolving over time. To address this problem, we propose an online LDA with infinite vocabulary--infvoc LDA. We derive online hybrid inference for our model and propose heuristics to dynamically order, expand, and contract the set of words in our vocabulary. We show that our algorithm is able to discover better topics by incorporating new words into the vocabulary and constantly refining the topics over time. In addition to LDA, we also show generality of the online hybrid inference framework by applying it to adaptor grammars, which are a broader class of models subsuming LDA. With proper grammar rules, it simplifies to the exact LDA model, however, it provides more flexibility to alter or extend LDA with different grammar rules. We develop online hybrid inference for adaptor grammar, and show that our method discovers high-quality structure more quickly than both MCMC and variational inference methods.
  • Thumbnail Image
    Item
    Sharing Private Data Over Public Networks
    (2012) Baden, Randy; Bhattacharjee, Bobby; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Users share their sensitive personal data with each other through public services and applications provided by third parties. Users trust application providers with their private data since they want access to provided services. However, trusting third parties with private data can be risky: providers profit by sharing that data with others regardless of the user's desires and may fail to provide the security necessary to prevent data leaks. Though users may choose between service providers, in many cases no service providers provide the desired service without being granted access to user data. Users must make a choice: forego privacy or be denied service. I demonstrate that fine-grained user privacy policies and rich services and applications are not irreconcilable. I provide technical solutions to privacy problems that protect user data using cryptography while still allowing services to operate on that data. I do this primarily through content-agnostic references to data items and user-controlled pseudonymity. I support two classes of social networking applications without trusting third parties with private data: applications which do not require data contents to provide a service, and applications that deal with data where the only private information is the binding of the data to an identity. Together, these classes of applications encompass a broad range of social networking applications.