Improved Online Learning and Modeling for Feature-Rich Discriminative Machine Translation
Publication or External Link
Most modern statistical machine translation (SMT) systems learn how to translate by constructing a discriminative model based on statistics from the data. A growing number of methods for discriminative training have been proposed, but most suffer from limitations hindering their utility for training feature-rich models on large amounts of data.
In this thesis, we present novel models and learning algorithms that address this issue by tackling three core problems for discriminative training: what to optimize, how to optimize, and how to represent the input. In addressing these issues, we develop fast learning algorithms that are both suitable for large-scale training and capable of generalization in high-dimensional feature spaces.
The algorithms are developed in an online margin-based framework. While these methods are firmly established in machine learning, their adaptation to SMT is not straightforward. Thus, the first problem we address is what to optimize when learning for SMT. We define a family of objective functions for large-margin learning with loss-augmented inference over latent variables, and investigate their optimization performance in standard and high-dimensional feature spaces.
After establishing what to optimize, the second problem we focus on is improving learning in the feature-rich space. We develop an online gradient-based algorithm that improves upon large-margin learning by considering and bounding the spread of the data while maximizing the margin.
Utilizing the learning regimes developed thus far, we are able to focus on the third problem and introduce new features targeting generalization to new domains. We employ topic models to perform unsupervised domain induction, and introduce adaptation features based on probabilistic domain membership.
As a final question, we look at how to take advantage of the latent derivation structure. In current models of SMT, there is an exponential number of derivations that produce the same translation. The standard practice is to sidestep this ambiguity. In the final part of the thesis, we define a framework for latent variable models which explicitly takes advantage of all derivations in both learning and inference. We present a novel loss function for large-margin learning in that setting along with developing a suitable optimization algorithm.