Rapid Resource Transfer for Multilingual Natural Language Processing
Rapid Resource Transfer for Multilingual Natural Language Processing
Files
Publication or External Link
Date
2005-12-02
Authors
Kolak, Okan
Advisor
Resnik, Philip
Citation
DRUM DOI
Abstract
Until recently the focus of the Natural Language Processing (NLP)
community has been on a handful of mostly European languages. However, the
rapid changes taking place in the economic and political climate of the
world precipitate a similar change to the relative importance given to
various languages. The importance of rapidly acquiring NLP resources and
computational capabilities in new languages is widely accepted.
Statistical NLP models have a distinct advantage over rule-based methods
in achieving this goal since they require far less manual labor. However,
statistical methods require two fundamental resources for training: (1)
online corpora (2) manual annotations. Creating these two resources can be
as difficult as porting rule-based methods.
This thesis demonstrates the feasibility of acquiring both corpora and
annotations by exploiting existing resources for well-studied languages.
Basic resources for new languages can be acquired in a rapid and
cost-effective manner by utilizing existing resources cross-lingually.
Currently, the most viable method of obtaining online corpora is
converting existing printed text into electronic form using Optical
Character Recognition (OCR). Unfortunately, a language that lacks online
corpora most likely lacks OCR as well. We tackle this problem by taking an
existing OCR system that was desgined for a specific language and using
that OCR system for a language with a similar script. We present a
generative OCR model that allows us to post-process output from a
non-native OCR system to achieve accuracy close to, or better than, a
native one. Furthermore, we show that the performance of a native or
trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We
present an algorithm that projects dependency trees across parallel
corpora. We also show that a reasonable quality treebank can be generated
by combining projection with a small amount of language-specific
post-processing. The projected treebank allows us to train a parser that
performs comparably to a parser trained on manually generated data.