Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
4 results
Search Results
Item Towards a Classification of Almost Complex and Spin^h Manifolds(2024) Mills, Keith; Rosenberg, Jonathan; Mathematics; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)We show that all homotopy CP^ns, smooth closed manifolds with the oriented homotopy type of CP^n, admit almost complex structures for 3 ≤ n ≤ 6, and classify these structures by their Chern classes for n=4, 6. Our methods provide a new proof of a result of Libgober and Wood on the classification of almost complex structures on homotopy CP^4s. We also show that all homotopy RP^(2k+1)s admit stably almost complex structures. Spin^h manifolds are the quaternionic analogue to spin^c manifolds. At the prime 2 we compute the spin^h bordism groups by proving a structure theorem for the cohomology of the spin^h bordism spectrum MSpin^h as a module over the mod 2 Steenrod algebra. This provides a 2-local splitting of MSpin^h as a wedge sum of familiar spectra. We also compute the decomposition of H^*(MSpin^h; Z/2Z) explicitly in degrees up through 30 via a counting process.Item GEOMETRIC REPRESENTATIONS AND DEEP GAUSSIAN CONDITIONAL RANDOM FIELD NETWORKS FOR COMPUTER VISION(2016) Vemulapalli, Raviteja; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Representation and context modeling are two important factors that are critical in the design of computer vision algorithms. For example, in applications such as skeleton-based human action recognition, representations that capture the 3D skeletal geometry are crucial for achieving good action recognition accuracy. However, most of the existing approaches focus mainly on the temporal modeling and classification steps of the action recognition pipeline instead of representations. Similarly, in applications such as image enhancement and semantic image segmentation, modeling the spatial context is important for achieving good performance. However, the standard deep network architectures used for these applications do not explicitly model the spatial context. In this dissertation, we focus on the representation and context modeling issues for some computer vision problems and make novel contributions by proposing new 3D geometry-based representations for recognizing human actions from skeletal sequences, and introducing Gaussian conditional random field model-based deep network architectures that explicitly model the spatial context by considering the interactions among the output variables. In addition, we also propose a kernel learning-based framework for the classification of manifold features such as linear subspaces and covariance matrices which are widely used for image set-based recognition tasks. This dissertation has been divided into five parts. In the first part, we introduce various 3D geometry-based representations for the problem of skeleton-based human action recognition. The proposed representations, referred to as R3DG features, capture the relative 3D geometry between various body parts using 3D rigid body transformations. We model human actions as curves in these R3DG feature spaces, and perform action recognition using a combination of dynamic time warping, Fourier temporal pyramid representation and support vector machines. Experiments on several action recognition datasets show that the proposed representations perform better than many existing skeletal representations. In the second part, we represent 3D skeletons using only the relative 3D rotations between various body parts instead of full 3D rigid body transformations. This skeletal representation is scale-invariant and belongs to a Lie group based on the special orthogonal group. We model human actions as curves in this Lie group and map these curves to the corresponding Lie algebra by combining the logarithm map with rolling maps. Using rolling maps reduces the distortions introduced in the action curves while mapping to the Lie algebra. Finally, we perform action recognition by classifying the Lie algebra curves using Fourier temporal pyramid representation and a support vector machines classifier. Experimental results show that by combining the logarithm map with rolling maps, we can get improved performance when compared to using the logarithm map alone. In the third part, we focus on classification of manifold features such as linear subspaces and covariance matrices. We present a kernel-based extrinsic framework for the classification of manifold features and address the issue of kernel selection using multiple kernel learning. We introduce two criteria for jointly learning the kernel and the classifier by solving a single optimization problem. In the case of support vector machine classifier, we formulate the problem of learning a good kernel-classifier combination as a convex optimization problem. The proposed approach performs better than many existing methods for the classification of manifold features when applied to image set-based classification task. In the fourth part, we propose a novel end-to-end trainable deep network architecture for image denoising based on a Gaussian Conditional Random Field (CRF) model. Contrary to existing discriminative denoising approaches, the proposed network explicitly models the input noise variance and hence is capable of handling a range of noise levels. This network consists of two sub-networks: (i) a parameter generation network that generates the Gaussian CRF pairwise potential parameters based on the input image, and (ii) an inference network whose layers perform the computations involved in an iterative Gaussian CRF inference procedure. Experiments on several images show that the proposed approach produces results on par with the state-of-the-art without training a separate network for each noise level. In the final part of this dissertation, we propose a Gaussian CRF model-based deep network architecture for the task of semantic image segmentation. This network explicitly models the interactions between output variables which is important for structured prediction tasks such as semantic segmentation. The proposed network is composed of three sub-networks: (i) a Convolutional Neural Network (CNN) based unary network for generating the unary potentials, (ii) a CNN-based pairwise network for generating the pairwise potentials, and (iii) a Gaussian mean field inference network for performing Gaussian CRF inference. When trained end-to-end in a discriminative fashion the proposed network outperforms various CNN-based semantic segmentation approaches.Item Face Recognition and Facial Attribute Analysis from Unconstrained Visual Data(2014) Ho, Huy Tho; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Analyzing human faces from visual data has been one of the most active research areas in the computer vision community. However, it is a very challenging problem in unconstrained environments due to variations in pose, illumination, expression, occlusion and blur between training and testing images. The task becomes even more difficult when only a limited number of images per subject is available for modeling these variations. In this dissertation, different techniques for performing classification of human faces as well as other facial attributes such as expression, age, gender, and head pose in uncontrolled settings are investigated. In the first part of the dissertation, a method for reconstructing the virtual frontal view from a given non-frontal face image using Markov Random Fields (MRFs) and an efficient variant of the Belief Propagation (BP) algorithm is introduced. In the proposed approach, the input face image is divided into a grid of overlapping patches and a globally optimal set of local warps is estimated to synthesize the patches at the frontal view. A set of possible warps for each patch is obtained by aligning it with images from a training database of frontal faces. The alignments are performed efficiently in the Fourier domain using an extension of the Lucas-Kanade (LK) algorithm that can handle illumination variations. The problem of finding the optimal warps is then formulated as a discrete labeling problem using an MRF. The reconstructed frontal face image can then be used with any face recognition technique. The two main advantages of our method are that it does not require manually selected facial landmarks as well as no head pose estimation is needed. In the second part, the task of face recognition in unconstrained settings is formulated as a domain adaptation problem. The domain shift is accounted for by deriving a latent subspace or domain, which jointly characterizes the multifactor variations using appropriate image formation models for each factor. The latent domain is defined as a product of Grassmann manifolds based on the underlying geometry of the tensor space, and recognition is performed across domain shift using statistics consistent with the tensor geometry. More specifically, given a face image from the source or target domain, multiple images of that subject are first synthesized under different illuminations, blur conditions, and 2D perturbations to form a tensor representation of the face. The orthogonal matrices obtained from the decomposition of this tensor, where each matrix corresponds to a factor variation, are used to characterize the subject as a point on a product of Grassmann manifolds. For cases with only one image per subject in the source domain, the identity of target domain faces is estimated using the geodesic distance on product manifolds. When multiple images per subject are available, an extension of kernel discriminant analysis is developed using a novel kernel based on the projection metric on product spaces. Furthermore, a probabilistic approach to the problem of classifying image sets on product manifolds is introduced. Understanding attributes such as expression, age class, and gender from face images has many applications in multimedia processing including content personalization, human-computer interaction, and facial identification. To achieve good performance in these tasks, it is important to be able to extract pertinent visual structures from the input data. In the third part of the dissertation, a fully automatic approach for performing classification of facial attributes based on hierarchical feature learning using sparse coding is presented. The proposed approach is generative in the sense that it does not use label information in the process of feature learning. As a result, the same feature representation can be applied for different tasks such as expression, age, and gender classification. Final classification is performed by linear SVM trained with the corresponding labels for each task. The last part of the dissertation presents an automatic algorithm for determining the head pose from a given face image. The face image is divided into a regular grid and represented by dense SIFT descriptors extracted from the grid points. Random Projection (RP) is then applied to reduce the dimension of the concatenated SIFT descriptor vector. Classification and regression using Support Vector Machine (SVM) are combined in order to obtain an accurate estimate of the head pose. The advantage of the proposed approach is that it does not require facial landmarks such as the eye and mouth corners, the nose tip to be extracted from the input face image as in many other methods.Item Efficient Sensing, Summarization and Classification of Videos(2012) Shroff, Nitesh; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Motion perception is an integral part of visual information processing. For example, humans use motion to perceive shape and structure of a scene, segment and recognize objects. Similarly, in computational vision, motion cues have been extensively used in numerous applications e.g., reconstructing 3D structure, object segmentation, etc. But there are several other applications such as pose estimation, scene recognition, etc., where motion plays a unique role, but traditionally they have been studied using cues other than motion. In this dissertation, we study few such applications with a focus on characterizing the role of motion. In particular, we study the role of motion in efficient (a) sensing, (b) summarization, and (c) classification of videos. We start by developing efficient sensing techniques, particularly in cases where computational vision is used for measurement -- inferring depth, position, orientation, etc. of the scene elements. Towards this direction, we begin with the goal of devising sensing techniques that allows the estimation of the scene layout of a generic scene i.e., the depth map of a scene. This is achieved by proposing an architecture and algorithm that senses the video by varying focal settings between consecutive frames. By extending the paradigm of Depth-from-defocus (DFD) to dynamic scenes, we achieve the reconstruction of the depth video and all-focus video from the captured video. This is followed by devising a technique which under constrained scenarios allows us to take a step further and estimate the precise location and orientation of the objects in the scene. We show that by capturing a sequence of images, while moving the illumination source between two consecutive frames, we can extract specular features on the high-curvature metallic objects. Robustly extracted specular features then allow us to estimate the pose of the objects with applications in machine vision. Next, we address the problem of concisely representing large video data. The goal here is to gain a quick overview of the video with minimum loss of details. We argue that this can be achieved by optimizing for the following two conflicting criteria: (a) Coverage -- requires that the summary be able to represent the original video well, and (b) Diversity -- requires that the elements of the summary be as distinct from each other as possible. This is formulated as a subset selection problem first in the Euclidean space and then generalized to non-Euclidean manifolds. The generic non-Euclidean manifold formulation allows the algorithm to handle generic computer-vision datasets like shapes, textures, linear dynamical systems, etc. A novel annealing-based alternation algorithm is proposed to select the optimal subset. Our experimental evaluation convincingly demonstrates that this formulation, effectively highlights diverse motion patterns in the video and hence outputs good summaries without actually using any domain knowledge. Finally, we turn our attention to classification of videos. Here, we begin with devising exact and approximate nearest neighbor (NN) techniques for fast retrieval of videos from large databases. As these videos or their representations, lie in non-Euclidean manifolds, the focus here is on formulating the problem such that it utilizes the geometry of the space. We present a geodesic hashing technique which employs intrinsic geodesic based functions to hash the data for realizing approximate but fast nearest neighbor retrieval. The proposed family of hashing functions, although intrinsic, is optimally selected to empirically satisfy the Locality Sensitive Hashing property. This work is followed up by another classification technique which focuses on generating content-based, particularly scene-based, annotations of videos. We focus on characterizing the motion of scene elements, and show that it not only provides fine-grained description of videos but also improves the classification accuracy. Subsequently, we propose dynamic attributes which can be augmented with spatial attributes of a scene to categorize dynamic scenes in a semantically meaningful way.