Center for Advanced Study of Language Research Works
Permanent URI for this collection
Browse
Browsing Center for Advanced Study of Language Research Works by Subject "annotation costs"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item An Approach to Reducing Annotation Costs for BioNLP(Association for Computational Linguistics, 2008-06) Bloodgood, Michael; Vijay-Shanker, KThere is a broad range of BioNLP tasks for which active learning (AL) can significantly reduce annotation costs and a specific AL algorithm we have developed is particularly effective in reducing annotation costs for these tasks. We have previously developed an AL algorithm called ClosestInitPA that works best with tasks that have the following characteristics: redundancy in training material, burdensome annotation costs, Support Vector Machines (SVMs) work well for the task, and imbalanced datasets (i.e. when set up as a binary classification problem, one class is substantially rarer than the other). Many BioNLP tasks have these characteristics and thus our AL algorithm is a natural approach to apply to BioNLP tasks.Item Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation(Association for Computational Linguistics, 2010-07) Bloodgood, Michael; Callison-Burch, ChrisWe explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement.Item A Method for Stopping Active Learning Based on Stabilizing Predictions and the Need for User-Adjustable Stopping(Association for Computational Linguistics, 2009-06) Bloodgood, Michael; Vijay-Shanker, KA survey of existing methods for stopping active learning (AL) reveals the needs for methods that are: more widely applicable; more aggressive in saving annotations; and more stable across changing datasets. A new method for stopping AL based on stabilizing predictions is presented that addresses these needs. Furthermore, stopping methods are required to handle a broad range of different annotation/performance tradeoff valuations. Despite this, the existing body of work is dominated by conservative methods with little (if any) attention paid to providing users with control over the behavior of stopping methods. The proposed method is shown to fill a gap in the level of aggressiveness available for stopping AL and supports providing users with control over stopping behavior.Item Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets(Association for Computational Linguistics, 2009-06) Bloodgood, Michael; Vijay-Shanker, KActively sampled data can have very different characteristics than passively sampled data. Therefore, it’s promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost-weighted SVMs for imbalanced data, a situation that arises for many HLT tasks. The key idea behind the proposed InitPA method for addressing imbalance is to base cost models during AL on an estimate of overall corpus imbalance computed via a small unbiased sample rather than the imbalance in the labeled training data, which is the leading method used during PL.Item Using Mechanical Turk to Build Machine Translation Evaluation Sets(Association for Computational Linguistics, 2010-06) Bloodgood, Michael; Callison-Burch, ChrisBuilding machine translation (MT) test sets is a relatively expensive task. As MT becomes increasingly desired for more and more language pairs and more and more domains, it becomes necessary to build test sets for each case. In this paper, we investigate using Amazon’s Mechanical Turk (MTurk) to make MT test sets cheaply. We find that MTurk can be used to make test sets much cheaper than professionally-produced test sets. More importantly, in experiments with multiple MT systems, we find that the MTurk-produced test sets yield essentially the same conclusions regarding system performance as the professionally-produced test sets yield.