A random forest system combination approach for error detection in digital dictionaries

dc.contributor.authorBloodgood, Michael
dc.contributor.authorYe, Peng
dc.contributor.authorRodrigues, Paul
dc.contributor.authorZajic, David
dc.contributor.authorDoermann, David
dc.date.accessioned2014-06-19T00:50:21Z
dc.date.available2014-06-19T00:50:21Z
dc.date.issued2012-04-23
dc.description.abstractWhen digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.en_US
dc.identifier.citationMichael Bloodgood, Peng Ye, Paul Rodrigues, David Zajic, and David Doermann. A random forest system combination approach for error detection in digital dictionaries. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid2012), pages 78-86. Association for Computational Linguistics, 2012.en_US
dc.identifier.urihttp://hdl.handle.net/1903/15085
dc.language.isoen_USen_US
dc.publisherAssociation for Computational Linguisticsen_US
dc.relation.isAvailableAtCenter for Advanced Study of Language
dc.relation.isAvailableAtDigitial Repository at the University of Maryland
dc.relation.isAvailableAtUniversity of Maryland (College Park, Md)
dc.subjectcomputer scienceen_US
dc.subjectartificial intelligenceen_US
dc.subjectmachine learningen_US
dc.subjecthuman language technologyen_US
dc.subjectnatural language processingen_US
dc.subjectsystem combinationen_US
dc.subjecthybrid systemsen_US
dc.subjectrandom forestsen_US
dc.subjecterror detectionen_US
dc.subjectelectronic lexicographyen_US
dc.subjectdigital dictionariesen_US
dc.subjectbilingual dictionariesen_US
dc.subjectcomputational linguistics
dc.titleA random forest system combination approach for error detection in digital dictionariesen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
randomForestHybrid2012.pdf
Size:
243.2 KB
Format:
Adobe Portable Document Format