An Investigation of the Relationship Between Automated Machine Translation Evaluation Metrics and User Performance on an Information Extraction Task
An Investigation of the Relationship Between Automated Machine Translation Evaluation Metrics and User Performance on an Information Extraction Task
Loading...
Files
Publication or External Link
Date
2007-12-04
Authors
Tate, Calandra Rilette
Advisor
Slud, Eric V
Dorr, Bonnie J
Dorr, Bonnie J
Citation
DRUM DOI
Abstract
This dissertation applies nonparametric statistical techniques toMachine Translation
(MT) Evaluation using data from a MT Evaluation experiment conducted
through a joint Army Research Laboratory (ARL) and Center for the Advanced
Study of Language (CASL) project. In particular, the relationship between human
task performance on an information extraction task with translated documents and
well-known automated translation evaluation metric scores for those documents is
studied. Findings from a correlation analysis of the connection between autometrics
and task-based metrics are presented and contrasted with current strategies for
evaluating translations. A novel idea for assessing partial rank correlation within
the presence of grouping factors is also introduced. Lastly, this dissertation presents
a framework for task-based machine translation (MT) evaluation and predictive
modeling of task responses that gives new information about the relative predictive strengths
of the different autometrics (and re-coded variants of them) within the statistical Generalized
Linear Models developed in analyses of the Information Extraction Task data.
This work shows that current autometrics are inadequate with respect to the
prediction of task performance but, near adequacy can be accomplished through the
use of re-coded autometrics in a logistic regression setting. As a result, a class of
automated metrics that are best suitable for predicting performance is established
and suggestions are offered about how to utilize metrics to supplement expensive
and time-consuming experiments with human participants. Now users can begin to
tie the intrinsic automated metrics to the extrinsic metrics for task they perform.
The bottom line is that there is a need to average away MT dependence (averaged
metrics perform better in overall predictions than original autometrics). Moreover,
combinations of recoded metrics performed better than any individual metric. Ultimately,
MT evaluation methodology is extended to create new metrics specially
relevant to task-based comparisons. A formal method to establish that differences
among metrics as predictors are strong enough not to be due by chance remains as
future work.
Given the lack of connection in the field of MT Evaluation between task utility
and the interpretation of automated evaluation metrics, as well as the absence of
solid statistical reasoning in evaluating MT, there is a need to bring innovative and
interdisciplinary analytical techniques to this problem. Because there are no papers
in the MT evaluation literature that have done statistical modeling before or that
have linked automated metrics with how well MT supports human tasks, this work
is unique and has high potential for benefiting the Machine Translation research community.