Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
4 results
Search Results
Item COMPLEXITY CONTROLLED NATURAL LANGUAGE GENERATION(2023) Agrawal, Sweta; Carpuat, Marine; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Generating text at the right level of complexity for its target audience so it can be easily understood by its target audience has the potential to make information more accessible to a wider range of people, including non-native speakers, language learners, and people who suffer from language or cognitive impairments. For example, a native Hindi speaker learning English might prefer reading a U.S. news article in English or Hindi tailored to their vocabulary and language proficiency level. Natural Language Generation (NLG), the use of computational models to generate human-like text, has been used to empower countless applications – from automatically summarizing financial and weather reports to enabling communication between multilingual communities through automatic translation. Although NLG has met some level of success, current models ignore that there are many valid ways of conveying the same information in a text and that selecting the appropriate variation requires knowing who the text is written for and its intended purpose. To address this, in this thesis, we present tasks, datasets, models, and algorithms that are designed to let users specify how simple or complex the generated text should be in a given language. We introduce the Complexity Controlled Machine Translation task, where the goal is to translate text from one language to another at a specific complexity level defined by the U.S. reading grade level. While standard machine translation (MT) tools generate a single output for each input, the models we design for this task produce translation at various complexity levels to suit the needs of different users. In order to build such models, we ideally require rich annotation and resources for supervised training, i.e., examples of the same input text paired with several translations in the output language, which is not available in most datasets used in MT. Hence, we have also contributed datasets that can enable the generation and evaluation of complexity-controlled translations. Furthermore, recognizing that when humans simplify a complex text in a given language, they often revise parts of the complex text according to the intended audience, we present strategies to adopt general-purpose Edit-based Non-Autoregressive models for controllable text simplification (TS). In this framing, the simplified output for a desired grade level is generated through a sequence of edit operations like deletions and insertions applied to the complex input sequence. As the model needs to learn to perform a wide range of edit operations for different target grade levels, we introduce algorithms to inject additional guidance during training and inference, which results in improved output quality while also providing users with the specific changes made to the input text. Finally, we present approaches to adapt general-purpose controllable TS models that leverage unsupervised pre-training and low-level control tokens describing the nature of TS edit operations as side constraints for grade-specific TS. Having developed models that can enable complexity-controlled text generation, in the final part of the thesis, we introduce a reading comprehension-based human evaluation framework that is designed to assess the correctness of texts generated by these systems using multiple-choice question-answering. Furthermore, we evaluate whether the measure of correctness (via the ability of native speakers to answer the questions correctly using the simplified texts) is captured by existing automatic metrics that measure text complexity or meaning preservation.Item Information Olfactation: Theory, Design, and Evaluation(2019) Patnaik, Biswaksen; Elmqvist, Niklas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Olfactory feedback for analytical tasks is a virtually unexplored area in spite of the advantages it offers for information recall, feature identification, and location detection. Here we introduce the concept of ‘Information Olfactation’ as the fragrant sibling of information visualization, and discuss how scent can be used to convey data. Building on a review of the human olfactory system and mirroring common visualization practice, we propose olfactory marks, the substrate in which they exist, and their olfactory channels that are available to designers. To exemplify this idea, we present ‘viScent(1.0)’: a six-scent stereo olfactory display capable of conveying olfactory glyphs of varying temperature and direction, as well as a corresponding software system that integrates the display with a traditional visualization display. We also conduct a comprehensive perceptual experiment on Information Olfactation: the use of olfactory marks and channels to convey data. More specifically, following the example from graphical perception studies, we design an experiment that studies the perceptual accuracy of four ``olfactory channels''---scent type, scent intensity, airflow, and temperature---for conveying three different types of data---nominal, ordinal, and quantitative. We also present details of an advanced 24-scent olfactory display: ‘viScent(2.0)’ and its software framework that we designed in order to run this experiment. Our results yield a ranking of olfactory channels for each data type that follows similar principles as rankings for visual channels, such as those derived by Mackinlay, Cleveland & McGill, and Bertin.Item Improving Statistical Machine Translation Using Comparable Corpora(2010) Snover, Matthew Garvey; Dorr, Bonnie; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by computers, Machine Translation (MT), has become an increasingly important area of research. State-of-the-art MT systems rely not upon hand-crafted translation rules written by human experts, but rather on learned statistical models that translate a source language to a target language. These models are typically generated from large, parallel corpora containing copies of text in both the source and target languages. The co-occurrence of words across languages in parallel corpora allows the creation of translation rules that specify the probability of translating words or phrases from one language to the other. Monolingual corpora, containing text only in one language--primarily the target language--are not used to model the translation process, but are used to better model the structure of the target language. Unlike parallel data, which require expensive human translators to generate, monolingual data are cheap and widely available. Similar topics and events to those in a source document that is being translated often occur in documents in a comparable monolingual corpus. In much the same way that a human translator would use world knowledge to aid translation, the MT system may be able to use these relevant documents from comparable corpora to guide translation by biasing the translation system to produce output more similar to the relevant documents. This thesis seeks to answer the following questions: (1) Is it possible to improve a modern, state-of-the-art translation system by biasing the MT output to be more similar to relevant passages from comparable monolingual text? (2) What level of similarity is necessary to exploit these techniques? (3) What is the nature of the relevant passages that are needed during the application of these techniques? To answer these questions, this thesis describes a method for generating new translation rules from monolingual data specifically targeted for the document that is being translated. Rule generation leverages the existing translation system and topical overlap between the foreign source text and the monolingual text, and unlike regular translation rule generation does not require parallel text. For each source document to be translated, potentially comparable documents are selected from the monolingual data using cross-lingual information retrieval. By biasing the MT system towards the selected relevant documents and then measuring the similarity of the biased output to the relevant documents using Translation Edit Rate Plus (TERp), it is possible to identify sub-sentential regions of the source and comparable documents that are possible translations of each other. This process results in the generation of new translation rules, where the source side is taken from the document to be translated and the target side is fluent target language text taken from the monolingual data. The use of these rules results in improvements over a state-of-the-art statistical translation system. These techniques are most effective when there is a high degree of similarity between the source and relevant passages--such as when they report on the same new stories--but some benefit, approximately half, can be achieved when the passages are only historically or topically related. The discovery of the feasibility of improving MT by using comparable passages to bias MT output provides a basis for future investigation on problems of this type. Ultimately, the goal is to provide a framework within which translation rules may be generated without additional parallel corpora, thus allowing researchers to test longstanding hypotheses about machine translation in the face of scarce parallel resources.Item Interactive Visualizations for Trees and Graphs(2006-04-27) Lee, Bongshin; Bederson, Benjamin B; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Graphs are a very commonly used information structure, and have been applied to a broad range of fields from computer science to biology. There are several important issues to consider when designing graph visualizations. One of the most difficult problems researchers face is how to visualize large graphs. While an algorithm may produce good layouts for graphs of several hundred nodes, it may not scale well to several thousand nodes. And, as the size of the graph increases, performance will degrade rapidly, making it difficult to build an interactive system. Label readability will also suffer, hindering users' abilities to understand the graph data and perform many tasks. Finally, even if a system can lay out and display large graphs, the cognitive demands placed on the user by the visualization may be overwhelming. This dissertation describes and applies several design principles to various graph visualization domains to address these issues. Tightly-coupled and highly customized views were used for graph visualization in a novel way. A new tree layout approach to graph visualization was proposed with appropriate visualization and interaction techniques. When visualizing graphs as trees, a guiding metaphor "Plant a seed and watch it grow" was used to support information gathering and detailed exploration of the graph's local structure. Three graph visualization systems guided by these design principles were also developed and evaluated. First, PaperLens provides an abstract overview of the full dataset and shows relationships through interactive highlighting. It offers a novel alternative to the more common node-link diagram approach to graph visualization. Second, the development and evaluation of TaxonTree provided valuable insights that led to the design of TreePlus, a general interactive graph visualization component. Finally, TreePlus takes a tree layout approach to graph visualization, transforming a graph into a tree plus cross links (the links not represented by the spanning tree) using visualization, animation and interaction techniques to reveal the graph structure while preserving the label readability. Other contributions of this work include the development of a task taxonomy for graph visualization and several specific applications of the graph visualization systems described above.