Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement

Thumbnail Image


umi-umd-2943.pdf (1.34 MB)
No. of downloads: 4179

Publication or External Link






Alignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge.

This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them.

The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively.

The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT.