Thumbnail Image

Publication or External Link





One of the core goals of Natural Language Processing (NLP) is to develop computationalrepresentations and methods to compare and contrast text meaning across languages. Such methods are essential to many NLP tasks, such as question answering and information retrieval. One of the limitations of those methods is the lack of sensitivity to detecting fine-grained semantic divergences, i.e., fine-meaning differences in sentences that overlap in content. Yet, such differences abound even in parallel texts, i.e., texts in two different languages that are typically perceived as exact translations of each other. Detecting such fine-grained semantic divergences across languages matters for machine translation systems, as they yield challenging training samples and for humans, who can benefit from a nuanced understanding of the source.

In this thesis, we focus on detecting fine-grained semantic divergences in parallel textsto improve machine and human translation understanding. In our first piece of work, we start by providing empirical evidence that such small meaning differences exist and can be reliably annotated both at a sentence and at a sub-sentential level. Then, we show that they can be automatically detected by fine-tuning large pre-trained language models without supervision by learning to rank synthetic divergences of varying granularity. In our second piece of work, we turn to analyzing the impact of fine-grained divergences on Neural Machine Translation (NMT) training and show that they negatively impact several aspects of NMT outputs, e.g., translation quality and confidence. Based on these findings, we present two orthogonal approaches to mitigating the negative impact of divergences and improve machine translation quality: first, we introduce a divergent-aware NMT framework that models divergences at training time; second, we present generation-based approaches for revising divergences in mined parallel texts to make the corresponding references more equivalent in meaning.

After exploring how subtle meaning differences in parallel texts impact machine translationsystems, we switch gears to understand how divergence detection can be used by humans directly. In our last piece of work, we extend our divergence detection methods to explain divergences from a human-centered perspective. We introduce a lightweight iterative algorithm that extracts contrastive phrasal highlights, i.e., highlights of segments indicating where divergences reside within bilingual texts, by explicitly formalizing the alignment between them. We show that our approach produces contrastive phrasal highlights that match human-provided rationales of divergences better than prior explainability approaches. Finally, based on extensive application-grounded evaluations, we show that contrastive phrasal highlights help bilingual speakers detect fine-grained meaning differences in human-translated texts, as well as critical errors due to local mistranslations in machine-translated texts.