Computer Vision for Scene Text Analaysis

Thumbnail Image


umi-umd-1745.pdf (10.89 MB)
No. of downloads: 1075

Publication or External Link






The motivation of this dissertation is to develop a 'Seeing-Eye' video-based interface for the visually impaired to access environmental text information. We are concerned with those daily activities of the low-vision people involved with interpreting 'environmental text' or 'scene text' e.g., reading a newspaper, can labels and street signs.

First, we discuss the devopement of such a video-based interface. In this interface, the processed image of a scene text is read by o®-the-shelf OCR and converted back to speech by Text-to-Speech(TTS) software. Our challenge is to feed a high quality image of a scene text for o®-the-shelf OCR software under general pose of the the surface on which text is printed. To achieve this, various problems related to feature detection, mosaicing, auto-focus, zoom, and systems integration were solved in the development of the system, and these are described.

We employ the video-based interface for the analysis of video of lectures/posters. In this application, the text is assumed to be on a plane. It is necessary for automatic analysis of video content to add modules such as enhancement, text segmentation, preprocessing video content, metric rectification, etc. We provide qualitative results to justify the algorithm and system integration.

For more general classes of surfaces that the text is printed on, such as bent or worked paper, we develop a novel method for 3D structure recovery and unwarping method. Deformed paper is isometric with a plane and the Gaussian curvature vanishes on every point on the surface. We show that these constraints lead to a closed set of equations that allow the recovery of the full geometric structure from a single image. We prove that these partial di®erential equations can be reduced to the Hopf equation that arises in non-linear wave propagation, and deformations of the paper can be interpreted in terms of the characteristics of this equation. A new exact integration of these equations relates the 3D structure of the surface to an image of a paper. In addition, we can generate such surfaces using the underlying equations. This method only uses information derived from the image of the boundary. Furthermore, we employ the shape-from-texture method as an alternative to the method above to infer its 3D structure. We showed that for the consistency of normal vector field, we need to add extra conditions based on the surface model. Such conditions are are isometry and zero Gaussian curvature of the surface.

The theory underlying the method is novel and it raises new open research issues in the area of 3D reconstruction from single views. The novel contributions are: first, it is shown that certain linear and non-linear clues (contour knowledge information) are su±cient to recover the 3D structure of scene text; second, that with a priori of a page layout information, we can reconstruct a fronto-parallel view of a deformed page from di®erential geometric properties of a surface; third, that with a known cameral model we can recover 3D structure of a bent surface; forth, we present an integrated framework for analysis and rectification of scene texts from single views in general format; fifth, we provide the comparison with shape from texture approach and finally this work can be integrated as a visual prostheses for the visually impaired.

Our work has many applications in computer vision and computer graphics. The applications are diverse e.g. a generalized scanning device, digital flattening of creased documents, 3D reconstruction problem when correspondence fails, 3D reconstruction of single old photos, bending and creasing virtual paper, object classification, semantic extraction, scene description and so on.