Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    A Computational Theory of the Use-Mention Distinction in Natural Language
    (2011) Wilson, Shomir; Perlis, Donald R.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    To understand the language we use, we sometimes must turn language on itself, and we do this through an understanding of the use-mention distinction. In particular, we are able to recognize mentioned language: that is, tokens (e.g., words, phrases, sentences, letters, symbols, sounds) produced to draw attention to linguistic properties that they possess. Evidence suggests that humans frequently employ the use-mention distinction, and we would be severely handicapped without it; mentioned language frequently occurs for the introduction of new words, attribution of statements, explanation of meaning, and assignment of names. Moreover, just as we benefit from mutual recognition of the use-mention distinction, the potential exists for us to benefit from language technologies that recognize it as well. With a better understanding of the use-mention distinction, applications can be built to extract valuable information from mentioned language, leading to better language learning materials, precise dictionary building tools, and highly adaptive computer dialogue systems. This dissertation presents the first computational study of how the use-mention distinction occurs in natural language, with a focus on occurrences of mentioned language. Three specific contributions are made. The first is a framework for identifying and analyzing instances of mentioned language, in an effort to reconcile elements of previous theoretical work for practical use. Definitions for mentioned language, metalanguage, and quotation have been formulated, and a procedural rubric has been constructed for labeling instances of mentioned language. The second is a sequence of three labeled corpora of mentioned language, containing delineated instances of the phenomenon. The corpora illustrate the variety of mentioned language, and they enable analysis of how the phenomenon relates to sentence structure. Using these corpora, inter-annotator agreement studies have quantified the concurrence of human readers in labeling the phenomenon. The third contribution is a method for identifying common forms of mentioned language in text, using patterns in metalanguage and sentence structure. Although the full breadth of the phenomenon is likely to elude computational tools for the foreseeable future, some specific, common rules for detecting and delineating mentioned language have been shown to perform well.
  • Thumbnail Image
    Item
    The Circle of Meaning: From Translation to Paraphrasing and Back
    (2010) Madnani, Nitin; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the "circle of meaning''. Today's statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human-authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity---the primary open problem that I address in this dissertation---one that has a serious effect on the SMT parameter tuning process. Bannard and Callison-Burch (2005) were the first to provide a practical connection between phrase-based statistical machine translation and paraphrase generation. However, their technique is restricted to generating phrasal paraphrases. I build upon their approach and augment a phrasal paraphrase extractor into a sentential paraphraser with extremely broad coverage. The novelty in this augmentation lies in the further strengthening of the connection between statistical machine translation and paraphrase generation; whereas Bannard and Callison-Burch only relied on SMT machinery to extract phrasal paraphrase rules and stopped there, I take it a few steps further and build a full English-to-English SMT system. This system can, as expected, ``translate'' any English input sentence into a new English sentence with the same degree of meaning preservation that exists in a bilingual SMT system. In fact, being a state-of-the-art SMT system, it is able to generate n-best "translations" for any given input sentence. This sentential paraphraser, built almost entirely from existing SMT machinery, represents the first 180 degrees of the circle of meaning. To complete the circle, I describe a novel connection in the other direction. I claim that the sentential paraphraser, once built in this fashion, can provide a solution to the reference sparsity problem and, hence, be used to improve the performance a bilingual SMT system. I discuss two different instantiations of the sentential paraphraser and show several results that provide empirical validation for this connection.
  • Thumbnail Image
    Item
    On an Apparent Limit to Verb Idiosyncrasy, Given a Mapping between Argument Realization and Polysemy (or Argument Optionality)
    (2007-10-02) Thomas, Scott; Perlis, Don; Oates, Tim; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Full-scale natural language processing systems require lots of information on thousands of words. This is especially true for systems handling the meanings of words and phrases, and it seems especially true for the verbs of a language: at first glance at least, and when viewed as if they were argument-taking functions, verbs seem to have highly individual requirements---along at least two dimensions. 1) They vary in the range of arguments they take (further complicated by polysemy, i.e. the proliferation of their senses). And to a significant extent 2) they vary in the way in which those arguments are realized in syntax. Since arbitrary information must be stored anyway---such as the particular concept pairing with the sound and/or spelling of a word---it seems reasonable to expect to store other potentially idiosyncratic information, including what might be needed for polysemy and argument realization. But once the meanings of words are stored, it isn't completely clear how much else really needs to be stored, in principle. With a significant degree of patterning in polysemy, and in argument realization, real speakers extrapolate from known senses and realizations. To fully model the processing of natural language, there must be at least some automatic production, and/or verification, of polysemy and argument realization, from the semantics. Since there are two phenomena here (polysemy and argument realization), the interaction between them could be crucial; and indeed particular instances of this interaction appear again and again in theoretical studies of syntax and meaning. Yet the real extent of the interaction has not itself been properly investigated. To do so, we supply, for the argument-taking configurations of 3000 English verbs, the typical kind of semantic specification---on the roles of their arguments---but do a kind of high-level analysis of the resulting patterns. The results suggest a rule of co-occurrences: divergences in argument realization are in fact rigorously accompanied by divergences in polysemy or argument optionality. We argue that this implies the existence of highly productive mechanisms for polysemy and argument realization, thus setting some crucial groundwork for their eventual production by automated means.
  • Thumbnail Image
    Item
    Rapid Resource Transfer for Multilingual Natural Language Processing
    (2005-12-02) Kolak, Okan; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data.
  • Thumbnail Image
    Item
    Inducing Semantic Frames from Lexical Resources
    (2004-02-17) Green, Rebecca Joyce; Dorr, Bonnie J; Computer Science
    The multiple ways in which propositional content can be expressed is often referred to as the paraphrase problem. This phenomenon creates challenges for such applications as information retrieval, information extraction, text summarization, and machine translation: Natural language understanding needs to recognize what remains constant across paraphrases, while natural language generation needs the ability to express content in various ways. Frame semantics is a theory of language understanding that addresses the paraphrase problem by providing slot-and-filler templates to represent frequently occurring, structured experiences. This dissertation introduces SemFrame, a system that induces semantic frames automatically from lexical resources (WordNet and the Longman Dictionary of Contemporary English [LDOCE]). Prior to SemFrame, semantic frames had been developed only by hand. In SemFrame, frames are first identified by enumerating groups of verb senses that evoke a common frame. This is done by combining evidence about pairs of semantically related verbs, based on LDOCE's subject field codes, words used in LDOCE definitions and WordNet glosses, WordNet's array of semantic relationships, etc. Pairs are gathered into larger groupings, deemed to correspond to semantic frames. Nouns associated with the verbs evoking a frame are then analyzed against WordNet's semantic network to identify nodes corresponding to frame slots. SemFrame is evaluated in two ways: (1) Compared against the handcrafted FrameNet, SemFrame achieves its best recall-precision balance with 83.2% recall (based on SemFrame's coverage of FrameNet frames) and 73.8% precision (based on SemFrame verbs' semantic relatedness to other frame-evoking verbs). A WordNet-hierarchy-based lower bound achieves 52.8% recall and 46.6% precision. (2) A frame-semantic-enhanced version of Hearst's TextTiling algorithm, applied to detecting boundaries between consecutive documents, improves upon the non-enhanced TextTiling algorithm at statistically significant levels. (Previous enhancement of the text segmentation algorithm with thesaural relationships had degraded performance.)