GEOMETRIC REPRESENTATIONS AND DEEP GAUSSIAN CONDITIONAL RANDOM FIELD NETWORKS FOR COMPUTER VISION
MetadataShow full item record
Representation and context modeling are two important factors that are critical in the design of computer vision algorithms. For example, in applications such as skeleton-based human action recognition, representations that capture the 3D skeletal geometry are crucial for achieving good action recognition accuracy. However, most of the existing approaches focus mainly on the temporal modeling and classification steps of the action recognition pipeline instead of representations. Similarly, in applications such as image enhancement and semantic image segmentation, modeling the spatial context is important for achieving good performance. However, the standard deep network architectures used for these applications do not explicitly model the spatial context. In this dissertation, we focus on the representation and context modeling issues for some computer vision problems and make novel contributions by proposing new 3D geometry-based representations for recognizing human actions from skeletal sequences, and introducing Gaussian conditional random field model-based deep network architectures that explicitly model the spatial context by considering the interactions among the output variables. In addition, we also propose a kernel learning-based framework for the classification of manifold features such as linear subspaces and covariance matrices which are widely used for image set-based recognition tasks. This dissertation has been divided into five parts. In the first part, we introduce various 3D geometry-based representations for the problem of skeleton-based human action recognition. The proposed representations, referred to as R3DG features, capture the relative 3D geometry between various body parts using 3D rigid body transformations. We model human actions as curves in these R3DG feature spaces, and perform action recognition using a combination of dynamic time warping, Fourier temporal pyramid representation and support vector machines. Experiments on several action recognition datasets show that the proposed representations perform better than many existing skeletal representations. In the second part, we represent 3D skeletons using only the relative 3D rotations between various body parts instead of full 3D rigid body transformations. This skeletal representation is scale-invariant and belongs to a Lie group based on the special orthogonal group. We model human actions as curves in this Lie group and map these curves to the corresponding Lie algebra by combining the logarithm map with rolling maps. Using rolling maps reduces the distortions introduced in the action curves while mapping to the Lie algebra. Finally, we perform action recognition by classifying the Lie algebra curves using Fourier temporal pyramid representation and a support vector machines classifier. Experimental results show that by combining the logarithm map with rolling maps, we can get improved performance when compared to using the logarithm map alone. In the third part, we focus on classification of manifold features such as linear subspaces and covariance matrices. We present a kernel-based extrinsic framework for the classification of manifold features and address the issue of kernel selection using multiple kernel learning. We introduce two criteria for jointly learning the kernel and the classifier by solving a single optimization problem. In the case of support vector machine classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem. The proposed approach performs better than many existing methods for the classification of manifold features when applied to image set-based classification task. In the fourth part, we propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (CRF) model. Contrary to existing discriminative denoising approaches, the proposed network explicitly models the input noise variance and hence is capable of handling a range of noise levels. This network consists of two sub-networks: (i) a parameter generation network that generates the Gaussian CRF pairwise potential parameters based on the input image, and (ii) an inference network whose layers perform the computations involved in an iterative Gaussian CRF inference procedure. Experiments on several images show that the proposed approach produces results on par with the state-of-the-art without training a separate network for each noise level. In the final part of this dissertation, we propose a Gaussian CRF model-based deep network architecture for the task of semantic image segmentation. This network explicitly models the interactions between output variables which is important for structured prediction tasks such as semantic segmentation. The proposed network is composed of three sub-networks: (i) a Convolutional Neural Network (CNN) based unary network for generating the unary potentials, (ii) a CNN-based pairwise network for generating the pairwise potentials, and (iii) a Gaussian mean field inference network for performing Gaussian CRF inference. When trained end-to-end in a discriminative fashion the proposed network outperforms various CNN-based semantic segmentation approaches.