Skip to content
University of Maryland LibrariesDigital Repository at the University of Maryland
    • Login
    View Item 
    •   DRUM
    • Center for Advanced Study of Language
    • Center for Advanced Study of Language Research Works
    • View Item
    •   DRUM
    • Center for Advanced Study of Language
    • Center for Advanced Study of Language Research Works
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    A random forest system combination approach for error detection in digital dictionaries

    Thumbnail
    View/Open
    randomForestHybrid2012.pdf (243.2Kb)
    No. of downloads: 335

    Date
    2012-04-23
    Author
    Bloodgood, Michael
    Ye, Peng
    Rodrigues, Paul
    Zajic, David
    Doermann, David
    Citation
    Michael Bloodgood, Peng Ye, Paul Rodrigues, David Zajic, and David Doermann. A random forest system combination approach for error detection in digital dictionaries. In Proceedings of the EACL Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid2012), pages 78-86. Association for Computational Linguistics, 2012.
    Metadata
    Show full item record
    Abstract
    When digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.
    URI
    http://hdl.handle.net/1903/15085
    Collections
    • Center for Advanced Study of Language Research Works

    DRUM is brought to you by the University of Maryland Libraries
    University of Maryland, College Park, MD 20742-7011 (301)314-1328.
    Please send us your comments.
    Web Accessibility
     

     

    Browse

    All of DRUMCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister
    Pages
    About DRUMAbout Download Statistics

    DRUM is brought to you by the University of Maryland Libraries
    University of Maryland, College Park, MD 20742-7011 (301)314-1328.
    Please send us your comments.
    Web Accessibility