Browsing by Author "Kolak, Okan"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
Item Domain Tuning of Bilingual Lexicons for MT(2003-02-27) Ayan, Necip Fazil; Dorr, Bonnie; Kolak, OkanOur overall objective is to translate a domain-specific document in a foreign language (in this case, Chinese) to English. Using automatically induced domain-specific, comparable documents and language-independent clustering, we apply domain-tuning techniques to a bilingual lexicon for downstream translation of the input document to English. We will describe our domain-tuning technique and demonstrate its effectiveness by comparing our results to manually constructed domain-specific vocabulary. Our coverage/accuracy experiments indicate that domain-tuned lexicons achieve 88% precision and 66% recall. We also ran a Bleu experiment to compare our domain-tuned version to its un-tuned counterpart in an IBM-style MT system. Our domain-tuned lexicons brought about an improvement in the Bleu scores: 9.4% higher than a system trained on a uniformly-weighted dictionary and 275% higher than a system trained on no dictionary at all. UMIACS-TR-2003-19 LAMP-TR-096Item Evaluating Translational Correspondence using Annotation Projection(2003-04-04) Hwa, Rebecca; Resnik, Philip; Weinberg, Amy; Kolak, OkanRecently, statistical machine translation models have begun to take advantage of higher level linguistic structures such as syntactic dependencies. Underlying these models is an assumption about the directness of translational correspondence between sentences in the two languages; however, the extent to which this assumption is valid and useful is not well understood. In this paper, we present an empirical study that quantifies the degree to which syntactic dependencies are preserved when parses are projected directly from English to Chinese. Our results show that although the direct correspondence assumption is often too restrictive, a small set of principled, elementary linguistic transformations can boost the quality of the projected Chinese parses by 76\% relative to the unimproved baseline. UMIACS-TR-2003-25 LAMP-TR-100 ,Item Rapid Resource Transfer for Multilingual Natural Language Processing(2005-12-02) Kolak, Okan; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data.