Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement

dc.contributor.advisorDorr, Bonnie Jen_US
dc.contributor.authorAyan, Necip Fazilen_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.description.abstractAlignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge. This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them. The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively. The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT.en_US
dc.format.extent1406702 bytes
dc.subject.pqcontrolledComputer Scienceen_US
dc.subject.pquncontrolledNatural Language Processingen_US
dc.subject.pquncontrolledComputational Linguisticsen_US
dc.subject.pquncontrolledMachine Translationen_US
dc.subject.pquncontrolledMachine Learningen_US
dc.subject.pquncontrolledWord Alignmenten_US
dc.subject.pquncontrolledClassifier Ensembleen_US
dc.titleCombining Linguistic and Machine Learning Techniques for Word Alignment Improvementen_US
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.34 MB
Adobe Portable Document Format