Linguistics Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2787

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    Item
    Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models
    (2009) Marton, Yuval Yehezkel; Resnik, Philip; Linguistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model.
  • Thumbnail Image
    Item
    Spin: Lexical Semantics, Transitivity, and the Identification of Implicit Sentiment
    (2007-08-01) Greene, Stephan Charles; Resnik, Philip; Linguistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Current interest in automatic sentiment analysis is motivated by a variety of information requirements. The vast majority of work in sentiment analysis has been specifically targeted at detecting subjective statements and mining opinions. This dissertation focuses on a different but related problem that to date has received relatively little attention in NLP research: detecting implicit sentiment, or spin, in text. This text classification task is distinguished from other sentiment analysis work in that there is no assumption that the documents to be classified with respect to sentiment are necessarily overt expressions of opinion. They rather are documents that might reveal a perspective. This dissertation describes a novel approach to the identification of implicit sentiment, motivated by ideas drawn from the literature on lexical semantics and argument structure, supported and refined through psycholinguistic experimentation. A relationship predictive of sentiment is established for components of meaning that are thought to be drivers of verbal argument selection and linking and to be arbiters of what is foregrounded or backgrounded in discourse. In computational experiments employing targeted lexical selection for verbs and nouns, a set of features reflective of these components of meaning is extracted for the terms. As observable proxies for the underlying semantic components, these features are exploited using machine learning methods for text classification with respect to perspective. After initial experimentation with manually selected lexical resources, the method is generalized to require no manual selection or hand tuning of any kind. The robustness of this linguistically motivated method is demonstrated by successfully applying it to three distinct text domains under a number of different experimental conditions, obtaining the best classification accuracies yet reported for several sentiment classification tasks. A novel graph-based classifier combination method is introduced which further improves classification accuracy by integrating statistical classifiers with models of inter-document relationships.
  • Thumbnail Image
    Item
    Necessary Bias in Natural Language Learning
    (2007-05-08) Pearl, Lisa Sue; Weinberg, Amy; Linguistics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This dissertation investigates the mechanism of language acquisition given the boundary conditions provided by linguistic representation and the time course of acquisition. Exploration of the mechanism is vital once we consider the complexity of the system to be learned and the non-transparent relationship between the observable data and the underlying system. It is not enough to restrict the potential systems the learner could acquire, which can be done by defining a finite set of parameters the learner must set. Even supposing that the system is defined by n binary parameters, we must still explain how the learner converges on the correct system(s) out of the possible 2^n systems, using data that is often highly ambiguous and exception-filled. The main discovery from the case studies presented here is that learners can in fact succeed provided they are biased to only use a subset of the available input that is perceived as a cleaner representation of the underlying system. The case studies are embedded in a framework that conceptualizes language learning as three separable components, assuming that learning is the process of selecting the best-fit option given the available data. These components are (1) a defined hypothesis space, (2) a definition of the data used for learning (data intake), and (3) an algorithm that updates the learner's belief in the available hypotheses, based on data intake. One benefit of this framework is that components can be investigated individually. Moreover, defining the learning components in this somewhat abstract manner allows us to apply the framework to a range of language learning problems and linguistics domains. In addition, we can combine discrete linguistic representations with probabilistic methods and so account for the gradualness and variation in learning that human children display. The tool of exploration for these case studies is computational modeling, which proves itself very useful in addressing the feasibility, sufficiency, and necessity of data intake filtering since these questions would be very difficult to address with traditional experimental techniques. In addition, the results of computational modeling can generate predictions that can then be tested experimentally.
  • Thumbnail Image
    Item
    Syntactic Identity and Locality Restrictions on Verbal Ellipsis
    (2004-05-03) Murguia, Elixabete; Uriagereka, Juan; Weinberg, Amy; Linguistics
    This dissertation investigates the topic of verbal ellipsis in English. Two main issues are addressed in this work: (i) the identity condition that restricts the application of ellipsis and (ii) the different locality restrictions that apply to elliptical constructions. The identity condition is examined from the point of view of competence, while the locality condition is given a natural answer from the processing domain. Furthermore, a parsing algorithm based on minimalist grammars is defined. Chapter 1 introduces the topic. Chapter 2 and Chapter 3 deal with the syntactic identity condition. Chapter 2 reviews some proposals in the literature, namely, Lasnik (1995b), Kitagawa (1991) and Fiengo and May (1994). All these analyses examine controversial examples where, apparently, partial syntactic identity between antecedent and gap is found. Chapter 3 presents a new analysis which assumes late lexical insertion, in the spirit of derivational morphology (Marantz 1993), and offers a unified account of all the cases of partial identity introduced in the previous chapter. It is argued that syntactic identity must be respected, and that the crucial notion for ellipsis is identity of syntactic categoriesa condition that is met before lexical items are inserted. Also, the different readings that obtain under ellipsis (i.e., sloppy and strict readings) are explained as emerging at different points in the derivation: before and after lexical insertion, respectively. Chapter 4 reviews one proposal in the parsing literature (Lappin and McCord 1990) as well as the problems it faces. Chapter 5 offers a processing account of the locality restrictions on gapping (as opposed to VPE and Pseudogapping)), those are analyzed as a result of (i) tense absence/presence (Fodor 1985), (ii) low initial attachment of coordinates, and (iii) Spell-out operations which render syntactic structure unavailable (Uriagereka 1999). A two-fold ellipsis resolution process is presented herewhere some work is done on-line, but some at the LF level. Chapter 6 defines an algorithm based on minimalist grammar operations, precisely on the preference of Merge-over-Move-over-Spell-out (as defined by Weinberg 1999); thus, showing that minimalist grammar models can be translated into computational models. Chapter 7 presents the conclusions.