Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 2 of 2
  • Thumbnail Image
    Item
    Identifying Semantic Divergences Across Languages
    (2019) Vyas, Yogarshi; Carpuat, Marine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality.
  • Thumbnail Image
    Item
    On an Apparent Limit to Verb Idiosyncrasy, Given a Mapping between Argument Realization and Polysemy (or Argument Optionality)
    (2007-10-02) Thomas, Scott; Perlis, Don; Oates, Tim; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Full-scale natural language processing systems require lots of information on thousands of words. This is especially true for systems handling the meanings of words and phrases, and it seems especially true for the verbs of a language: at first glance at least, and when viewed as if they were argument-taking functions, verbs seem to have highly individual requirements---along at least two dimensions. 1) They vary in the range of arguments they take (further complicated by polysemy, i.e. the proliferation of their senses). And to a significant extent 2) they vary in the way in which those arguments are realized in syntax. Since arbitrary information must be stored anyway---such as the particular concept pairing with the sound and/or spelling of a word---it seems reasonable to expect to store other potentially idiosyncratic information, including what might be needed for polysemy and argument realization. But once the meanings of words are stored, it isn't completely clear how much else really needs to be stored, in principle. With a significant degree of patterning in polysemy, and in argument realization, real speakers extrapolate from known senses and realizations. To fully model the processing of natural language, there must be at least some automatic production, and/or verification, of polysemy and argument realization, from the semantics. Since there are two phenomena here (polysemy and argument realization), the interaction between them could be crucial; and indeed particular instances of this interaction appear again and again in theoretical studies of syntax and meaning. Yet the real extent of the interaction has not itself been properly investigated. To do so, we supply, for the argument-taking configurations of 3000 English verbs, the typical kind of semantic specification---on the roles of their arguments---but do a kind of high-level analysis of the resulting patterns. The results suggest a rule of co-occurrences: divergences in argument realization are in fact rigorously accompanied by divergences in polysemy or argument optionality. We argue that this implies the existence of highly productive mechanisms for polysemy and argument realization, thus setting some crucial groundwork for their eventual production by automated means.