Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
7 results
Search Results
Item COMPLEXITY CONTROLLED NATURAL LANGUAGE GENERATION(2023) Agrawal, Sweta; Carpuat, Marine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Generating text at the right level of complexity for its target audience so it can be easily understood by its target audience has the potential to make information more accessible to a wider range of people, including non-native speakers, language learners, and people who suffer from language or cognitive impairments. For example, a native Hindi speaker learning English might prefer reading a U.S. news article in English or Hindi tailored to their vocabulary and language proficiency level. Natural Language Generation (NLG), the use of computational models to generate human-like text, has been used to empower countless applications – from automatically summarizing financial and weather reports to enabling communication between multilingual communities through automatic translation. Although NLG has met some level of success, current models ignore that there are many valid ways of conveying the same information in a text and that selecting the appropriate variation requires knowing who the text is written for and its intended purpose. To address this, in this thesis, we present tasks, datasets, models, and algorithms that are designed to let users specify how simple or complex the generated text should be in a given language. We introduce the Complexity Controlled Machine Translation task, where the goal is to translate text from one language to another at a specific complexity level defined by the U.S. reading grade level. While standard machine translation (MT) tools generate a single output for each input, the models we design for this task produce translation at various complexity levels to suit the needs of different users. In order to build such models, we ideally require rich annotation and resources for supervised training, i.e., examples of the same input text paired with several translations in the output language, which is not available in most datasets used in MT. Hence, we have also contributed datasets that can enable the generation and evaluation of complexity-controlled translations. Furthermore, recognizing that when humans simplify a complex text in a given language, they often revise parts of the complex text according to the intended audience, we present strategies to adopt general-purpose Edit-based Non-Autoregressive models for controllable text simplification (TS). In this framing, the simplified output for a desired grade level is generated through a sequence of edit operations like deletions and insertions applied to the complex input sequence. As the model needs to learn to perform a wide range of edit operations for different target grade levels, we introduce algorithms to inject additional guidance during training and inference, which results in improved output quality while also providing users with the specific changes made to the input text. Finally, we present approaches to adapt general-purpose controllable TS models that leverage unsupervised pre-training and low-level control tokens describing the nature of TS edit operations as side constraints for grade-specific TS. Having developed models that can enable complexity-controlled text generation, in the final part of the thesis, we introduce a reading comprehension-based human evaluation framework that is designed to assess the correctness of texts generated by these systems using multiple-choice question-answering. Furthermore, we evaluate whether the measure of correctness (via the ability of native speakers to answer the questions correctly using the simplified texts) is captured by existing automatic metrics that measure text complexity or meaning preservation.Item Stronger Inductive Biases for Sample-Efficient and Controllable Neural Machine Translation(2023) Xu, Weijia; Carpuat, Marine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)As one of the oldest applications of natural language processing, machine translation (MT) has a growing impact on human lives both as an end application and as a key component of cross-lingual information processing such as cross-lingual information retrieval and dialogue generation. Although neural machine translation (NMT) models achieve impressive performance on some language pairs, they are trained on large amounts of human translations. In addition, they are notorious for generating fluent outputs that do not faithfully reflect the meaning of the source sentence, and they make it difficult for users to control the outputs. To address these issues, this thesis contributes techniques to build more sample-efficient and controllable NMT models by incorporating stronger inductive biases that help correct undesirable biases, integrate prior knowledge, and introduce flexible ways to control the outputs in NMT. In our first line of research, we show that current NMT models are susceptible to undesirable biases that hinder sample-efficient training and lead to unfaithful translations. We further provide evidence that we can mitigate these undesirable biases by integrating stronger inductive biases through training algorithms. We start by introducing a new training objective to address the exposure bias problem — a common problem in sequence generation models that typically causes accumulated errors along the generated sequence at inference time, especially when the training data is limited. Next, we turn to a well-known but less studied problem in MT — the hallucination problem — translation outputs that are unrelated to the source text. To find spurious biases that cause hallucination errors, we first identify model symptoms that are indicative of hallucinations at inference time. And then, we show how these symptoms connect to the spurious biases at training time, where the model learns to predict the ground-truth translation while ignoring a large part of the source sentence. These findings provide a future path toward mitigating hallucinations by addressing these spurious biases. In our second line of research, we study how to integrate stronger inductive biases in NMT for effective integration of the language priors estimated from unsupervised data. We introduce a novel semi-supervised learning objective with a theoretical guarantee on its global optimum and show that it can be effectively approximated and leads to improved performance in practice. Finally, we study inductive biases in the form of NMT model architectures to allow end users to control the model outputs more easily. Controlling the outputs of standard NMT models is difficult with high computational cost at training or inference time. We develop an edit-based NMT model with novel edit operations that can incorporate users' lexical constraints with low computational cost at both training and inference time. To allow users to provide lexical constraints in more flexible morphological forms, we further introduce a modular framework for inflecting and integrating lexical constraints in NMT.Item Improved Online Learning and Modeling for Feature-Rich Discriminative Machine Translation(2013) Eidelman, Vladimir; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Most modern statistical machine translation (SMT) systems learn how to translate by constructing a discriminative model based on statistics from the data. A growing number of methods for discriminative training have been proposed, but most suffer from limitations hindering their utility for training feature-rich models on large amounts of data. In this thesis, we present novel models and learning algorithms that address this issue by tackling three core problems for discriminative training: what to optimize, how to optimize, and how to represent the input. In addressing these issues, we develop fast learning algorithms that are both suitable for large-scale training and capable of generalization in high-dimensional feature spaces. The algorithms are developed in an online margin-based framework. While these methods are firmly established in machine learning, their adaptation to SMT is not straightforward. Thus, the first problem we address is what to optimize when learning for SMT. We define a family of objective functions for large-margin learning with loss-augmented inference over latent variables, and investigate their optimization performance in standard and high-dimensional feature spaces. After establishing what to optimize, the second problem we focus on is improving learning in the feature-rich space. We develop an online gradient-based algorithm that improves upon large-margin learning by considering and bounding the spread of the data while maximizing the margin. Utilizing the learning regimes developed thus far, we are able to focus on the third problem and introduce new features targeting generalization to new domains. We employ topic models to perform unsupervised domain induction, and introduce adaptation features based on probabilistic domain membership. As a final question, we look at how to take advantage of the latent derivation structure. In current models of SMT, there is an exponential number of derivations that produce the same translation. The standard practice is to sidestep this ambiguity. In the final part of the thesis, we define a framework for latent variable models which explicitly takes advantage of all derivations in both learning and inference. We present a novel loss function for large-margin learning in that setting along with developing a suitable optimization algorithm.Item Discriminative Interlingual Representations(2013) Jagarlamudi, Jagadeesh; Jagarlamudi, Jagadeesh; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The language barrier in many multilingual natural language processing (NLP) tasks can be overcome by mapping objects from different languages (“views”) into a common low-dimensional subspace. For example, the name transliteration task involves mapping bilingual names and word translation mining involves mapping bilingual words into a common low-dimensional subspace. Multi-view models learn such a low-dimensional subspace using a training corpus of paired objects, e.g., names written in different languages, represented as feature vectors. The central idea of my dissertation is to learn low-dimensional subspaces (or interlingual representations) that are effective for various multilingual and monolingual NLP tasks. First, I demonstrate the effectiveness of interlingual representations in mining bilingual word translations, and then proceed to developing models for diverse situations that often arise in NLP tasks. In particular, I design models for the following problem settings: 1) when there are more than two views but we only have training data from a single pivot view into each of the remaining views 2) when an object from one view is associated with a ranked list of objects from another view, and finally 3) when the underlying objects have rich structure, such as a tree. These problem settings arise often in real world applications. I choose a canonical task for each of the settings and compare my model with existing state-of-the-art baseline systems. I provide empirical evidence for the first two models on multilingual name transliteration and reranking for the part-of-speech tagging tasks, espectively. For the third problem setting, I experiment with the task of re-scoring target language word translations based on the source word's context. The model roposed for this problem builds on the ideas proposed in the previous models and, hence, leads to a natural conclusion.Item The Circle of Meaning: From Translation to Paraphrasing and Back(2010) Madnani, Nitin; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the "circle of meaning''. Today's statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human-authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity---the primary open problem that I address in this dissertation---one that has a serious effect on the SMT parameter tuning process. Bannard and Callison-Burch (2005) were the first to provide a practical connection between phrase-based statistical machine translation and paraphrase generation. However, their technique is restricted to generating phrasal paraphrases. I build upon their approach and augment a phrasal paraphrase extractor into a sentential paraphraser with extremely broad coverage. The novelty in this augmentation lies in the further strengthening of the connection between statistical machine translation and paraphrase generation; whereas Bannard and Callison-Burch only relied on SMT machinery to extract phrasal paraphrase rules and stopped there, I take it a few steps further and build a full English-to-English SMT system. This system can, as expected, ``translate'' any English input sentence into a new English sentence with the same degree of meaning preservation that exists in a bilingual SMT system. In fact, being a state-of-the-art SMT system, it is able to generate n-best "translations" for any given input sentence. This sentential paraphraser, built almost entirely from existing SMT machinery, represents the first 180 degrees of the circle of meaning. To complete the circle, I describe a novel connection in the other direction. I claim that the sentential paraphraser, once built in this fashion, can provide a solution to the reference sparsity problem and, hence, be used to improve the performance a bilingual SMT system. I discuss two different instantiations of the sentential paraphraser and show several results that provide empirical validation for this connection.Item Improving Statistical Machine Translation Using Comparable Corpora(2010) Snover, Matthew Garvey; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available. Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques? To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related. The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources.Item Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement(2005-11-23) Ayan, Necip Fazil; Dorr, Bonnie J; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Alignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge. This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them. The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively. The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT.