Electrical & Computer Engineering Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2765
Browse
2 results
Search Results
Item Scene and Video Understanding(2014) Jain, Arpit; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)There have been significant improvements in the accuracy of scene understanding due to a shift from recognizing objects ``in isolation'' to context based recognition systems. Such systems improve recognition rates by augmenting appearance based models of individual objects with contextual information based on pairwise relationships between objects. These pairwise relations incorporate common sense world knowledge such as co-occurrences and spatial arrangements of objects, temporal consistency, scene layout, etc. However, these relations, even though consistent in the 3D world, change due to viewpoint of the scene. In this thesis, we investigate incorporating contextual information from three different perspectives for scene and video understanding (a) ``what'' contextual relations are useful and ``how'' they should be incorporated into Markov network during inference, (b) jointly solving the segmentation and recognition problem using a multiple segmentation framework based on contextual information in conjunction with appearance matching, and (c) proposing a discriminative spatio-temporal patch based representation for videos which incorporates contextual information for video understanding. Our work departs from traditional view of incorporating context into scene understanding where a fixed model for context is learned. We argue that context is scene dependent and propose a data-driven approach to predict the importance of relationships and construct a Markov network for image analysis based on statistical models of global and local image features. Since all contextual information is not equally important, we also address the related problem of predicting the feature weights associated with each edge of a Markov network for evaluation of context. We then address the problem of fixed segmentation while modeling context by using a multiple segmentation framework and formulating the problem as ``a jigsaw puzzle''. We formulate the labeling problem as segment selection from a pool of segments (jigsaws), assigning each selected segment a class label. Previous multiple segmentation approaches used local appearance matching to select segments in a greedy manner. In contrast, our approach is based on a cost function that combines contextual information with appearance matching. A relaxed form of the cost function is minimized using an efficient quadratic programming solver. Lastly, we propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What define these spatiotemporal patches are their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos. We propose a cost function that incorporates co-occurrence statistics and temporal context along with appearance matching to select subset of these patches for label transfer. Furthermore, these patches can be used as a discriminative vocabulary for action classification.Item Recognizing Objects And Reasoning About Their Interactions(2010) Kembhavi, Aniruddha; Davis, Larry S; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The task of scene understanding involves recognizing the different objects present in the scene, segmenting the scene into meaningful regions, as well as obtaining a holistic understanding of the activities taking place in the scene. Each of these problems has received considerable interest within the computer vision community. We present contributions to two aspects of visual scene understanding. First we explore multiple methods of feature selection for the problem of object detection. We demonstrate the use of Principal Component Analysis to detect avifauna in field observation videos. We improve on existing approaches by making robust decisions based on regional features and by a feature selection strategy that chooses different features in different parts of the image. We then demonstrate the use of Partial Least Squares to detect vehicles in aerial and satellite imagery. We propose two new feature sets; Color Probability Maps are used to capture the color statistics of vehicles and their surroundings, and Pairs of Pixels are used to capture captures the structural characteristics of objects. A powerful feature selection analysis based on Partial Least Squares is employed to deal with the resulting high dimensional feature space (almost 70,000 dimensions). We also propose an Incremental Multiple Kernel Learning (IMKL) scheme to detect vehicles in a traffic surveillance scenario. Obtaining task and scene specific datasets of visual categories is far more tedious than obtaining a generic dataset of the same classes. Our IMKL approach initializes on a generic training database and then tunes itself to the classification task at hand. Second, we develop a video understanding system for scene elements, such as bus stops, crosswalks, and intersections, that are characterized more by qualitative activities and geometry than by intrinsic appearance. The domain models for scene elements are not learned from a corpus of video, but instead, naturally elicited by humans, and represented as probabilistic logic rules within a Markov Logic Network framework. Human elicited models, however, represent object interactions as they occur in the 3D world rather than describing their appearance projection in some specific 2D image plane. We bridge this gap by recovering qualitative scene geometry to analyze object interactions in the 3D world and then reasoning about scene geometry, occlusions and common sense domain knowledge using a set of meta-rules.