Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    Modeling Dependencies in Natural Languages with Latent Variables
    (2011) Huang, Zhongqiang; Harper, Mary; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this thesis, we investigate the use of latent variables to model complex dependencies in natural languages. Traditional models, which have a fixed parameterization, often make strong independence assumptions that lead to poor performance. This problem is often addressed by incorporating additional dependencies into the model (e.g., using higher order N-grams for language modeling). These added dependencies can increase data sparsity and/or require expert knowledge, together with trial and error, in order to identify and incorporate the most important dependencies (as in lexicalized parsing models). Traditional models, when developed for a particular genre, domain, or language, are also often difficult to adapt to another. In contrast, previous work has shown that latent variable models, which automatically learn dependencies in a data-driven way, are able to flexibly adjust the number of parameters based on the type and the amount of training data available. We have created several different types of latent variable models for a diverse set of natural language processing applications, including novel models for part-of-speech tagging, language modeling, and machine translation, and an improved model for parsing. These models perform significantly better than traditional models. We have also created and evaluated three different methods for improving the performance of latent variable models. While these methods can be applied to any of our applications, we focus our experiments on parsing. The first method involves self-training, i.e., we train models using a combination of gold standard training data and a large amount of automatically labeled training data. We conclude from a series of experiments that the latent variable models benefit much more from self-training than conventional models, apparently due to their flexibility to adjust their model parameterization to learn more accurate models from the additional automatically labeled training data. The second method takes advantage of the variability among latent variable models to combine multiple models for enhanced performance. We investigate several different training protocols to combine self-training with model combination. We conclude that these two techniques are complementary to each other and can be effectively combined to train very high quality parsing models. The third method replaces the generative multinomial lexical model of latent variable grammars with a feature-rich log-linear lexical model to provide a principled solution to address data sparsity, handle out-of-vocabulary words, and exploit overlapping features during model induction. We conclude from experiments that the resulting grammars are able to effectively parse three different languages. This work contributes to natural language processing by creating flexible and effective latent variable models for several different languages. Our investigation of self-training, model combination, and log-linear models also provides insights into the effective application of these machine learning techniques to other disciplines.
  • Thumbnail Image
    Item
    Finite Mixture Model Specifications Accommodating Treatment Nonresponse in Experimental Research
    (2009) Wasko, John A.; Hancock, Gregory R; Measurement, Statistics and Evaluation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    For researchers exploring causal inferences with simple two group experimental designs, results are confounded when using common statistical methods and further are unsuitable in cases of treatment nonresponse. In signal processing, researchers have successfully extracted multiple signals from data streams with Gaussian mixture models, where their use is well matched to accommodate researchers in this predicament. While the mathematics underpinning models in either application remains unchanged, there are stark differences. In signal processing, results are definitively evaluated assessing whether extracted signals are interpretable. Such obvious feedback is unavailable to researchers seeking causal inference who instead rely on empirical evidence from inferential statements regarding mean differences, as done in analysis of variance (ANOVA). Two group experimental designs do provide added benefit by anchoring treatment nonrespondents' distributional response properties from the control group. Obtaining empirical evidence supporting treatment nonresponse, however, can be extremely challenging. First, if indeed nonresponse exists, then basic population means, ANOVA or repeated measures tests cannot be used because of a violation of the identical distribution property required for each method. Secondly, the mixing parameter or proportion of nonresponse is bounded between 0 and 1, so does not subscribe to normal distribution theory to enable inference by common methods. This dissertation introduces and evaluates the performance of an information-based methodology as a more extensible and informative alternative to statistical tests of population means while addressing treatment nonresponse. Gaussian distributions are not required under this methodology which simultaneously provides empirical evidence through model selection regarding treatment nonresponse, equality of population means, and equality of variance hypotheses. The use of information criteria measures as an omnibus assessment of a set of mixture and non-mixture models within a maximum likelihood framework eliminates the need for a Newton-Pearson framework of probabilistic inferences on individual parameter estimates. This dissertation assesses performance in recapturing population conditions for hypotheses' conclusions, parameter accuracy, and class membership. More complex extensions addressing multiple treatments, multiple responses within a treatment, a priori consideration of covariates, and multivariate responses within a latent framework are also introduced.