Context Driven Scene Understanding

Thumbnail Image


Publication or External Link






Understanding objects in complex scenes is a fundamental and challenging problem in computer vision. Given an image, we would like to answer the questions of whether there is an object of a particular category in the image, where is it, and if possible, locate it with a bounding box or pixel-wise labels. In this dissertation, we present context driven approaches leveraging relationships between objects in the scene to improve both the accuracy and efficiency of scene understanding.

In the first part, we describe an approach to jointly solve the segmentation and recognition problem using a multiple segmentation framework with context. Our approach formulates a cost function based on contextual information in conjunction with appearance matching. This relaxed cost function formulation is minimized using an efficient quadratic programming solver and an approximate solution is obtained by discretizing the relaxed solution. Our approach improves labeling performance compared to other segmentation based recognition approaches.

Secondly, we introduce a new problem called object co-labeling where the goal is to jointly annotate multiple images of the same scene which do not have temporal consistency. We present an adaptive framework for joint segmentation and recognition to solve this problem. We propose an objective function that considers not only appearance but also appearance and context consistency across images of the scene. A relaxed form of the cost function is minimized using an efficient quadratic programming solver. Our approach improves labeling performance compared to labeling each image individually. We also show the application of our co-labeling framework to other recognition problems such as label propagation in videos and object recognition in similar scenes.

In the third part, we propose a novel general strategy for simultaneous object detection and segmentation. Instead of passively evaluating all object detectors at all possible locations in an image, we develop a divide-and-conquer approach by actively and sequentially evaluating contextual cues related to the query based on the scene and previous evaluations---like playing a ``20 Questions'' game---to decide where to search for the object. Such questions are dynamically selected based on the query, the scene and current observed responses given by object detectors and classifiers. We first present an efficient object search policy based on information gain of asking a question. We formulate the policy in a probabilistic framework that integrates current information and observation to update the model and determine the next most informative action to take next. We further enrich the power and generalization capacity of the Twenty Questions strategy by learning the Twenty Questions policy driven by data. We formulate the problem as a Markov Decision Process and learn a search policy by imitation learning.