Productive Vision: Methods for Automatic Image Comprehension

Thumbnail Image


Publication or External Link






Image comprehension is the ability to summarize, translate, and answer basic questions about images. Using original techniques for scene object parsing, material labeling, and activity recognition, a system can gather information about the objects and actions in a scene. When this information is integrated into a deep knowledge

base capable of inference, the system becomes capable of performing tasks that, when performed by students, are considered by educators to demonstrate comprehension.

The vision components of the system consist of the following: object scene parsing by means of visual filters, material scene parsing by superpixel segmentation and kernel descriptors, and activity recognition by action grammars. These techniques are characterized and compared with the state-of-the-art in their respective fields. The output of the vision components is a list of assertions in a Cyc microtheory.

By reasoning on these assertions and the rest of the Cyc knowledge base, the system is able to perform a variety of tasks, including the following:

Recognize essential parts of objects are likely present in the scene despite not having an explicit detector for them.

Recognize the likely presence of objects due to the presence of their essential parts.

Improve estimates of both object and material labels by incorporating knowledge about the typical pairings.

Label ambiguous objects with a more general label that encompasses both possible labelings.

Answer questions about the scene that require inference and give justifications for the answers in natural language.

Create a visual representation of the scene in a new medium.

Recognize scene similarity even when there is little visual similarity.