Efficient Machine Learning Methods for Document Image Analysis

Thumbnail Image


Publication or External Link






With the exponential growth in volume of multimedia content on the internet, there has been an increasing interest for developing more efficient and scalable algorithms to learn directly from data without excessive restrictions on nature of the content. In the context of document images, many large scale digitization projects have called for reliable and scalable triage methods for enhancement, segmentation, grouping and categorization of captured images. Current approaches, however, are typically limited to a specific class of documents such as scanned books, newspapers, journal articles or forms for example, and analysis and processing of more unconstrained and noisy heterogeneous document collections has not been as widely addressed. Additionally, existing machine-learning based approaches for document processing need to be carefully applied to handle the challenges associated with large and imbalanced training data.

In this thesis, we address these challenges in three primary applications of document image analysis - low-level document enhancement, mid-level handwritten line segmentation, and high-level classification and retrieval. We first present a data selection method for training Support Vector Machines (SVM) on large-scale data sets. We apply the proposed approach to pixel-level document image enhancement, and show promising results with a relatively small number of training samples. Second, we present a graph-based method for segmentation of handwritten document images into text-lines which is more efficient and adaptive than previous approaches. Our approach demonstrates that combining results from local and global methods enhances the final performance of text-line segmentation. Third, we present an approach to compute structural similarities between images for classification and retrieval. Results on real-world data sets show that the approach is more effective than earlier approaches when the labeled data is limited. We extend our classification approach to a completely unsupervised setting, where both the number of classes and representative samples from each class is assumed to be unknown. We present a method for computing similarities based on learned structural patterns and correlations from the given data. Experiments with four different data sets show that our approach can estimate number of classes in large document collections and group structurally similar images with a high-accuracy.