Browsing by Author "Dorr, Bonnie"
Now showing 1 - 15 of 15
Results Per Page
Sort Options
Item A Categorial Variation Database for English(2003-02-27) Habash, Nizar; Dorr, BonnieWe describe our approach to the construction and evaluation of a large-scale database called ``CatVar'' which contains categorial variations of English lexemes. Due to the prevalence of cross-language categorial variation in multilingual applications, our categorial-variation resource may serve as an integral part of a diverse range of natural language applications. Thus, the research reported herein overlaps heavily with that of the machine-translation, lexicon-construction, and information-retrieval communities. We apply the information-retrieval metrics of precision and recall to evaluate the accuracy and coverage of our database with respect to a human-produced gold standard. This evaluation reveals that the categorial database achieves a high degree of precision and recall. Additionally, we demonstrate that the database improves on the linkability of Porter Stemmer by over 30\%. UMIACS-TR-2003-13 LAMP-TR-095Item Citation Handling for Improved Summarization of Scientific Documents(2011-07-25) Whidby, Michael; Zajic, David; Dorr, BonnieIn this paper we present the first steps toward improving summarization of scientific documents through citation analysis and parsing. Prior work (Mohammad et al., 2009) argues that citation texts (sentences that cite other papers) play a crucial role in automatic summarization of a topical area, but did not take into account the noise introduced by the citations themselves. We demonstrate that it is possible to improve summarization output through careful handling of these citations. We base our experiments on the application of an improved trimming approach to summarization of citation texts extracted from Question-Answering and Dependency-Parsing documents. We demonstrate that confidence scores from the Stanford NLP Parser (Klein and Manning, 2003) are significantly improved, and that Trimmer (Zajic et al., 2007), a sentence-compression tool, is able to generate higher-quality candidates. Our summarization output is currently used as part of a larger system, Action Science Explorer (ASE) (Gove, 2011).Item Construction of Chinese-English Semantic Hierarchy for Information Retrieval(2000-06-10) Levow, Gina-Anne; Dorr, Bonnie; Lin, DekangThis paper describes an approach to large-scale construction of a semantic hierarchy for Chinese verbs. Leveraging off of an existing Chinese conceptual database called HowNet and a Levin-based English verb classification, we use thematic-role information to create links between Chinese concepts and English classes. The resulting hierarchy is used for multilingual lexicons in an English-Chinese cross-language information retrieval application. We demonstrate a structured syntax interface that exploits this large-scale hierarchy and its linkages to WordNet for English-Chinese cross-language information retrieval. (Also cross-referenced asUMIACS-TR-2000-36) (Also cross-referenced as LAMP-TR-043)Item Domain Tuning of Bilingual Lexicons for MT(2003-02-27) Ayan, Necip Fazil; Dorr, Bonnie; Kolak, OkanOur overall objective is to translate a domain-specific document in a foreign language (in this case, Chinese) to English. Using automatically induced domain-specific, comparable documents and language-independent clustering, we apply domain-tuning techniques to a bilingual lexicon for downstream translation of the input document to English. We will describe our domain-tuning technique and demonstrate its effectiveness by comparing our results to manually constructed domain-specific vocabulary. Our coverage/accuracy experiments indicate that domain-tuned lexicons achieve 88% precision and 66% recall. We also ran a Bleu experiment to compare our domain-tuned version to its un-tuned counterpart in an IBM-style MT system. Our domain-tuned lexicons brought about an improvement in the Bleu scores: 9.4% higher than a system trained on a uniformly-weighted dictionary and 275% higher than a system trained on no dictionary at all. UMIACS-TR-2003-19 LAMP-TR-096Item Domain-Specific Term-List Expansion Using Existing Linguistic Resources(2002-10-03) Dorr, Bonnie; Zhao, TiejunThis report describes a series of experiments involving expansion of a domain-specific human-generated "seed list" using available linguistic resources. The resources used for the expansion are intended to be general purpose: two large-scale Chinese-English dictionaries and a Chinese lexical knowledge base (HowNet). The methodology involves three steps: (1) hand extraction of head words from each entry in the human-generated seed list; (2) automatic comparison of these head words against entries in the linguistic resources-where an entry matches if the head word matches the entry exactly or is included in its the semantic definition; and (3) collection of any resulting matching entries into a larger term list. The terms extracted by this process were verified manually to confirm whether they were relevant to the topic of a specific domain. An important contribution of this work is the finding that the use of a bilingual term list for the expansion process does not provide a significant improvement over the use of a simpler, more easily produced, monolingual term list. (Also LAMP-TR-092) (Also UMIACS-TR-2002-79)Item Efficient Language Independent Generation from Lexical Conceptual Structures(2001-09-05) Habash, Nizar; Dorr, Bonnie; Traum, DavidThis paper describes a system for generating natural-language sentences from an interlingual representation, Lexical Conceptual Structure (LCS). The system has been developed as part of a Chinese-English Machine Translation system; however, it is designed to be used for many other MT language pairs and Natural Language applications. The contributions of this work include: (1) Development of a language-independent generation system that maximizes efficiency through the use of a hybrid rule-based/statistical module; (2) Enhancements to an interlingual representation and associated algorithms for interpretation of multiply ambiguous input sentences;(3) Development of an efficient reusable language-independent linearization module with a grammar description language that can be used with other systems; (4) Improvements to an earlier algorithm for hierarchically mapping thematic roles to surface positions; (5) Development of a diagnostic tool for lexicon coverage and correctness and use of the tool for verification of English, Spanish, and Chinese lexicons. An evaluation of translation quality shows comparable performance with a commercial translation system. The generation system can also be straightforwardly extended to other languages and this is demonstrated and evaluated for Spanish. Cross-referenced as UMIACS-TR-2001-43Item Enhancing Automatic Acquisition of Thematic Structure in a Large-Scale Lexicon for Mandarin Chinese(1998-10-15) Olsen, Mari Broman; Dorr, Bonnie; Thomas, ScottThis paper describes a refinement to our procedure for porting lexical conceptual structure into new languages. Specifically we describe a two-step process for creating candidate thematic grids for Mandarin Chinese verbs, using the English verb heading the VP in the subdefinitions to separate senses, and roughly parsing the verb complement structure to match to our thematic structure templates. The procedure is part of a larger process of creating a usable lexicon for interlingual machine translation from a large on-line resource with both too much and too little information necessary for our system. (Also cross-referenced as UMIACS-TR-98-35)Item Handling Translation Divergences in Generation-Heavy Hybrid Machine Translation(2002-04-04) Habash, Nizar; Dorr, BonnieThis paper describes a novel approach for handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system. The approach depends on the existence of rich target language resources such as word lexical semantics, including information about categorial variations and subcategorization frames. These resources are used to generate multiple structural variations from a target-glossed lexico-syntactic representation of the source language sentence. The multiple structural variations account for different translation divergences. The overgeneration of the approach is constrained by a target-language model using corpus-based statistics. The exploitation of target language resources (symbolic and statistical) to handle a problem usually reserved to Transfer and Interlingual MT is useful for translation from structurally divergent source languages with scarce linguistic resources. A preliminary evaluation on the application of this approach to Spanish-English MT proves this approach extremely promising. The approach however is not limited to MT as it can be extended to monolingual NLG applications such as summarization. Also UMIACS-TR-2002-23 Also LAMP-TR-083Item Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation(2002-05-22) Habash, Nizar; Dorr, BonnieThis paper describes a novel approach to handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system.The translation divergence problem is usually reserved for Transfer and Interlingual MT because it requires a large combination of complex lexical and structural mappings. A major requirement of these approaches is the accessibility of large amounts of explicit symmetrical knowledge for both source and target languages. This limitation renders Transfer and Interlingual approaches ineffective in the face of structurally-divergent language pairs with asymmetrical resources. GHMT addresses the more common form of this problem, ource-poor/target-rich, by fully exploiting symbolic and statistical target-language resources. This is accomplished by using target-language lexical semantics, categorial variations and subcategorization frames to overgenerate multiple lexico-structural variations from a target-glossed syntactic dependency of the source-language sentence. The symbolic overgeneration, which accounts for different possible translation divergences, is constrained by a statistical target-language model. (Also LAMP-TR-088) (Also UMIACS-TR-2002-49)Item Large Scale Language Independent Generation Using Thematic Hierarchies(2001-09-05) Habash, Nizar; Dorr, BonnieThis paper describes a large-scale language-independent evaluation of the use of Thematic Hierarchies in natural language generation. We translate from a corpus of sentences reflecting the full variety of behavior of Levin-based verb classes. The corpus is used as input to a generation system that utilizes the same thematic hierarchy for realizing relative argument surface positions in two languages: English and Spanish. The output was manually evaluated by English and Spanish speakers. The contributions of this work include: (1) an improved thematic hierarchy over an earlier implementation; (2) a large-scale evaluation of the use of thematic hierarchies in two languages; (3) an implementation of a language independent module for natural language generation; and (4) the creation of a single tool for incremental development of multilingual lexicons. Cross-referenced as UMIACS-TR-2001-59Item A Modality Lexicon and its use in Automatic Tagging(European Language Resources Association, 2010-05) Baker, Kathryn; Bloodgood, Michael; Dorr, Bonnie; Filardo, Nathaniel; Levin, Lori; Piatko, ChristineThis paper describes our resource-building results for an eight-week JHU Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation. Specifically, we describe the construction of a modality annotation scheme, a modality lexicon, and two automated modality taggers that were built using the lexicon and annotation scheme. Our annotation scheme is based on identifying three components of modality: a trigger, a target and a holder. We describe how our modality lexicon was produced semi-automatically, expanding from an initial hand-selected list of modality trigger words and phrases. The resulting expanded modality lexicon is being made publicly available. We demonstrate that one tagger—a structure-based tagger—results in precision around 86% (depending on genre) for tagging of a standard LDC data set. In a machine translation application, using the structure-based tagger to annotate English modalities on an English-Urdu training corpus improved the translation quality score for Urdu by 0.3 Bleu points in the face of sparse training data.Item Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach(2010-10) Baker, Kathryn; Bloodgood, Michael; Callison-Burch, Chris; Dorr, Bonnie; Filardo, Nathaniel; Levin, Lori; Miller, Scott; Piatko, ChristineWe describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English translation task. This finding supports the hypothesis (posed by many researchers in the MT community, e.g., in DARPA GALE) that both syntactic and semantic information are critical for improving translation quality—and further demonstrates that large gains can be achieved for low-resource languages with different word order than English.Item Statistical Modality Tagging from Rule-based Annotations and Crowdsourcing(Association for Computational Linguistics, 2012-07-13) Prabhakaran, Vinodkumar; Bloodgood, Michael; Diab, Mona; Dorr, Bonnie; Levin, Lori; Piatko, Christine; Rambow, Owen; Van Durme, BenjaminWe explore training an automatic modality tagger. Modality is the attitude that a speaker might have toward an event or state. One of the main hurdles for training a linguistic tagger is gathering training data. This is particularly problematic for training a tagger for modality because modality triggers are sparse for the overwhelming majority of sentences. We investigate an approach to automatically training a modality tagger where we first gathered sentences based on a high-recall simple rule-based modality tagger and then provided these sentences to Mechanical Turk annotators for further annotation. We used the resulting set of training data to train a precise modality tagger using a multi-class SVM that delivers good performance.Item Use of Modality and Negation in Semantically-Informed Syntactic MT(MIT Press, 2012-06-26) Baker, Kathryn; Bloodgood, Michael; Dorr, Bonnie; Callison-Burch, Chris; Filardo, Nathaniel; Piatko, Christine; Levin, Lori; Miller, ScottThis article describes the resource- and system-building efforts of an 8-week Johns Hopkins University Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation), and a holder (an experiencer of modality). We describe how our MN lexicon was semi-automatically produced and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. Although the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described here. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu–English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.Item Use of OCR for Rapid Constrution of Bilingual Lexicons(2003-09-25) Karagol-Ayan, Burcu; Doermann, David; Dorr, BonnieThis paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited. A rule-based and an HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictionary. We evaluate the effectiveness of our techniques, concluding that: (1) the rule-based method performs better on dictionaries with a simple structure; (2) the stochastic method performs better on dictionaries with an enriched structure; (3) regardless of the degree of dictionary richness, the rule-based method gives better results for phrasal entries than for single-word entries; and (4) Our resulting bilingual lexicons are comprehensive enough to provide reasonable MT results when compared to human-constructed lexicons. (LAMP-TR-104) (CAR-TR-986) (UMIACS-TR-2003-78)