Identifying Semantic Divergences Across Languages

Vyas, Yogarshi

Identifying Semantic Divergences Across Languages

dc.contributor.advisor	Carpuat, Marine	en_US
dc.contributor.author	Vyas, Yogarshi	en_US
dc.contributor.department	Computer Science	en_US
dc.contributor.publisher	Digital Repository at the University of Maryland	en_US
dc.contributor.publisher	University of Maryland (College Park, Md.)	en_US
dc.date.accessioned	2020-02-01T06:40:36Z
dc.date.available	2020-02-01T06:40:36Z
dc.date.issued	2019	en_US
dc.description.abstract	Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality.	en_US
dc.identifier	https://doi.org/10.13016/rymp-ymgo
dc.identifier.uri	http://hdl.handle.net/1903/25448
dc.language.iso	en	en_US
dc.subject.pqcontrolled	Computer science	en_US
dc.subject.pqcontrolled	Linguistics	en_US
dc.subject.pquncontrolled	lexical semantics	en_US
dc.subject.pquncontrolled	machine learning	en_US
dc.subject.pquncontrolled	machine translation	en_US
dc.subject.pquncontrolled	multilingual nlp	en_US
dc.subject.pquncontrolled	natural language processing	en_US
dc.title	Identifying Semantic Divergences Across Languages	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Vyas_umd_0117E_20447.pdf
Size:: 2.32 MB
Format:: Adobe Portable Document Format

Download

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations