Computational Mid-Level Vision: From Border Ownership to Categorical Object Recognition

Thumbnail Image


Publication or External Link





Since it was proposed in 1890 by Christian von Ehrenfels, Gestalt psychology has remained a key school of thought that explains how one perceives the world ("the whole'') from the sum of its individual components ("the parts'') or processes. These processes are aptly summarized in the well known "Rules of Gestalt''. In spite of its influence in other fields, the empirical nature of Gestalt rules impedes their widespread adoption in Computer Science. This thesis serves to bridge this apparent divide by making Mid-level Vision, or Computer Vision based on Gestalt rules, not only computationally feasible but also practical for real applications. We address the general problem of figure-ground organization, where the goal is to separate the foreground (or object) from the background. To do this, we first formulate a fast approach that pairs Structured Random Forests (SRFs) with Gestalt-like features, for both boundary detection and border ownership assignment. We then show how border ownership information is useful for shape-based recognition of object categories. This is done by embedding ownership information into the image torque, a grouping operator that detects closure patterns in the image edge, so that we modulate the operator in an efficient manner for detecting class-specific contours in clutter and occlusion. Next, we show how symmetry, an important shape-based regularity in Gestalt psychology, can be detected in clutter and be used for guiding segmentation of symmetric foreground regions. Besides shape and symmetry, functionality is another important mid-level cue that supports categorical object recognition. Based on Gibson's principle of affordance, we introduce a fast technique based on a SRF trained with geometric features that provides pixel-accurate affordances of tool parts. Finally, we describe as future work how language can be exploited to "activate'' such mid-level processes so that a joint semantic space can be obtained for linking visual concepts to language to solve even more challenging problems in Computer Vision, effectively reducing the so-called "semantic gap'' between these two related domains.