Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

dc.contributor.advisorResnik, Philipen_US
dc.contributor.authorMarton, Yuval Yehezkelen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.description.abstractThis dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. <italic>Fine granularity</italic> is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (&ldquo;distributional profiles&rdquo;) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A <italic>novel paraphrase generation technique</italic> incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current &ldquo;pivoting&rdquo; techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel <italic>semantic reinforcement component:</italic> In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the &ldquo;goodness&rdquo; of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to &ldquo;trust&rdquo; the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model.en_US
dc.subject.pqcontrolledLanguage, Linguisticsen_US
dc.subject.pqcontrolledComputer Scienceen_US
dc.subject.pquncontrolledcomputational linguisticsen_US
dc.subject.pquncontrolledparaphrase generationen_US
dc.subject.pquncontrolledsemantic distanceen_US
dc.subject.pquncontrolledsoft constraintsen_US
dc.subject.pquncontrolledstatistical machine translationen_US
dc.titleFine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Modelsen_US
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
1.65 MB
Adobe Portable Document Format