Taking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced Datasets

Bloodgood, MichaelVijay-Shanker, KTaking into Account the Differences between Actively and Passively Acquired Data: The Case of Active Learning with Support Vector Machines for Imbalanced DatasetsAssociation for Computational Linguistics2009computer sciencestatistical methodsartificial intelligencemachine learningcomputational linguisticsnatural language processinghuman language technologytext processingactive learningselective samplingquery learningbinary classificationannotation bottleneckannotation costssupport vector machinesSVMscost-weighted support vector machinescost-weighted SVMsimbalanced dataimbalanced datasetscorpus imbalanceimbalanced learningasymmetric cost factorsasymmetric cost weightspositive amplificationcost modelscost modelingcost-sensitive learningcost-sensitive active learningrelation extractionBioNLPbiomedical natural language processingbiomedical text processingprotein-protein interaction extractiontext classificationnewswire text classificationen-USArticleActively sampled data can have very different characteristics than passively sampled data. Therefore, it’s promising to investigate using different inference procedures during AL than are used during passive learning (PL). This general idea is explored in detail for the focused case of AL with cost-weighted SVMs for imbalanced data, a situation that arises for many HLT tasks. The key idea behind the proposed InitPA method for addressing imbalance is to base cost models during AL on an estimate of overall corpus imbalance computed via a small unbiased sample rather than the imbalance in the labeled training data, which is the leading method used during PL.