Efficient Sensing, Summarization and Classification of Videos
MetadataShow full item record
Motion perception is an integral part of visual information processing. For example, humans use motion to perceive shape and structure of a scene, segment and recognize objects. Similarly, in computational vision, motion cues have been extensively used in numerous applications e.g., reconstructing 3D structure, object segmentation, etc. But there are several other applications such as pose estimation, scene recognition, etc., where motion plays a unique role, but traditionally they have been studied using cues other than motion. In this dissertation, we study few such applications with a focus on characterizing the role of motion. In particular, we study the role of motion in efficient (a) sensing, (b) summarization, and (c) classification of videos. We start by developing efficient sensing techniques, particularly in cases where computational vision is used for measurement -- inferring depth, position, orientation, etc. of the scene elements. Towards this direction, we begin with the goal of devising sensing techniques that allows the estimation of the scene layout of a generic scene i.e., the depth map of a scene. This is achieved by proposing an architecture and algorithm that senses the video by varying focal settings between consecutive frames. By extending the paradigm of Depth-from-defocus (DFD) to dynamic scenes, we achieve the reconstruction of the depth video and all-focus video from the captured video. This is followed by devising a technique which under constrained scenarios allows us to take a step further and estimate the precise location and orientation of the objects in the scene. We show that by capturing a sequence of images, while moving the illumination source between two consecutive frames, we can extract specular features on the high-curvature metallic objects. Robustly extracted specular features then allow us to estimate the pose of the objects with applications in machine vision. Next, we address the problem of concisely representing large video data. The goal here is to gain a quick overview of the video with minimum loss of details. We argue that this can be achieved by optimizing for the following two conflicting criteria: (a) Coverage -- requires that the summary be able to represent the original video well, and (b) Diversity -- requires that the elements of the summary be as distinct from each other as possible. This is formulated as a subset selection problem first in the Euclidean space and then generalized to non-Euclidean manifolds. The generic non-Euclidean manifold formulation allows the algorithm to handle generic computer-vision datasets like shapes, textures, linear dynamical systems, etc. A novel annealing-based alternation algorithm is proposed to select the optimal subset. Our experimental evaluation convincingly demonstrates that this formulation, effectively highlights diverse motion patterns in the video and hence outputs good summaries without actually using any domain knowledge. Finally, we turn our attention to classification of videos. Here, we begin with devising exact and approximate nearest neighbor (NN) techniques for fast retrieval of videos from large databases. As these videos or their representations, lie in non-Euclidean manifolds, the focus here is on formulating the problem such that it utilizes the geometry of the space. We present a geodesic hashing technique which employs intrinsic geodesic based functions to hash the data for realizing approximate but fast nearest neighbor retrieval. The proposed family of hashing functions, although intrinsic, is optimally selected to empirically satisfy the Locality Sensitive Hashing property. This work is followed up by another classification technique which focuses on generating content-based, particularly scene-based, annotations of videos. We focus on characterizing the motion of scene elements, and show that it not only provides fine-grained description of videos but also improves the classification accuracy. Subsequently, we propose dynamic attributes which can be augmented with spatial attributes of a scene to categorize dynamic scenes in a semantically meaningful way.