Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
6 results
Search Results
Item Detecting and Recognizing Humans, Objects, and their Interactions(2020) Bansal, Ankan; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Scene understanding is a high-level vision task which involves not just localizing and recognizing objects and people but also inferring their layouts and interactions with each other. However, current systems for even atomic tasks like object detection suffer from several shortcomings. Most object detectors can only detect a limited number of object categories; face recognition systems are prone to make mistakes for faces in extreme poses or illuminations; and automated systems for detecting interactions between humans and objects perform poorly. We hypothesize that scene understanding can be improved by using additional semantic data from outside sources and intelligently and efficiently using the available data. Given the fact that it is nearly impossible to collect labeled training data for thousands of object categories, we introduce the problem of zero-shot object detection (ZSD). Here, “zero-shot” means recognizing/detecting without using any visual data during training. We first present an approach for ZSD using semantic information encoded in word-vectors which are trained on a large text corpus. We discuss some challenges associated with ZSD. The most important of these challenges is the definition of a “background” class in this setting. It is easy to define a “background” class in fully-supervised settings. However, it’s not clear what constitutes a “background” ZSD. We present principled approaches for dealing with this challenge and evaluate our approaches on challenging sets of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. Next, we tackle the problem of detecting human-object interactions (HOIs). Here, again, it is impossible to collect labeled data for each type of possible interaction. We show that solutions for HOI detection can greatly benefit from semantic information. We present two approaches for solving this problem. In the first approach, we exploit functional similarities between objects to share knowledge between models for different classes. The main idea is that humans look similar while interacting with functionally similar objects. We show that, using this idea, even a simple model can achieve state-of-the-art results for HOI detection both in the supervised and zero-shot settings. Our second model uses semantic information in the form of spatial layout of a person and an object to detect their interactions. This model contains a layout module which primes the visual module to make the final prediction. An automated scene understanding system should, further, be able to answer natural language questions posed by humans about a scene. We introduce the problem of Image-Set Visual Question Answering (ISVQA) as a generalization of existing tasks of Visual Question Answering (VQA) for still images, and video VQA. We describe two large-scale datasets collected for this problem: one for indoor scenes and one for outdoor scenes. We provide a comprehensive analysis of the two datasets. We also adapt VQA models to design baselines for this task and demonstrate the difficulty of the problem. Finally, we present new datasets for training face recognition systems. Using these datasets, we show that careful consideration of some critical questions before training can lead to significant improvements in face verification performance. We use some lessons from these experiments to train a face recognition system which can identify and verify faces accurately. We show that our model, trained with the recently introduced Crystal Loss, can achieve state-of-the-art performance for many challenging face recognition benchmarks like IJB-A, IJB-B, and IJB-C. We evaluate our system on the Disguised Faces in the Wild (DFW) dataset and show convincing first results.Item ROBUST REPRESENTATIONS FOR UNCONSTRAINED FACE RECOGNITION AND ITS APPLICATIONS(2016) Chen, Jun-Cheng; Chellappa, Rama; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face identification and verification are important problems in computer vision and have been actively researched for over two decades. There are several applications including mobile authentication, visual surveillance, social network analysis, and video content analysis. Many algorithms have shown to work well on images collected in controlled settings. However, the performance of these algorithms often degrades significantly on images that have large variations in pose, illumination and expression as well as due to aging, cosmetics, and occlusion. How to extract robust and discriminative feature representations from face images/videos is an important problem to achieve good performance in uncontrolled settings. In this dissertation, we present several approaches to extract robust feature representation from a set of images/video frames for face identification and verification problems. We first present a dictionary approach with dense facial landmark features. Each face video is segmented into K partitions first, and the multi-scale features are extracted from patches centered at detected facial landmarks. Then, compact and representative dictionaries are learned from dense features for each partition of a video and then concatenated together into a video dictionary representation for the video. Experiments show that the representation is effective for the unconstrained video-based face identification task. Secondly, we present a landmark-based Fisher vector approach for video-based face verification problems. This approach encodes over-complete local features into a high-dimensional feature representation followed by a learned joint Bayesian metric to project the feature vector into a low-dimensional space and to compute the similarity score. We then present an automated system for face verification which exploits features from deep convolutional neural networks (DCNN) trained using the CASIA-WebFace dataset. Our experimental results show that the DCNN model is able to characterize the face variations from the large-scale source face dataset and generalizes well to another smaller one. Finally, we also demonstrate that the model pre-trained for face identification and verification tasks encodes rich face information which benefit other face-related tasks with scarce annotated training data. We use apparent age estimation as an example and develop a cascade convolutional neural network framework which consists of age group classification and age regression, and a deep networks is fine-tuned using the target data.Item RECOGNITION OF FACES FROM SINGLE AND MULTI-VIEW VIDEOS(2014) Du, Ming; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition has been an active research field for decades. In recent years, with videos playing an increasingly important role in our everyday life, video-based face recognition has begun to attract considerable research interest. This leads to a wide range of potential application areas, including TV/movies search and parsing, video surveillance, access control etc. Preliminary research results in this field have suggested that by exploiting the abundant spatial-temporal information contained in videos, we can greatly improve the accuracy and robustness of a visual recognition system. On the other hand, as this research area is still in its infancy, developing an end-to-end face processing pipeline that can robustly detect, track and recognize faces remains a challenging task. The goal of this dissertation is to study some of the related problems under different settings. We address the video-based face association problem, in which one attempts to extract face tracks of multiple subjects while maintaining label consistency. Traditional tracking algorithms have difficulty in handling this task, especially when challenging nuisance factors like motion blur, low resolution or significant camera motions are present. We demonstrate that contextual features, in addition to face appearance itself, play an important role in this case. We propose principled methods to combine multiple features using Conditional Random Fields and Max-Margin Markov networks to infer labels for the detected faces. Different from many existing approaches, our algorithms work in online mode and hence have a wider range of applications. We address issues such as parameter learning, inference and handling false positves/negatives that arise in the proposed approach. Finally, we evaluate our approach on several public databases. We next propose a novel video-based face recognition framework. We address the problem from two different aspects: To handle pose variations, we learn a Structural-SVM based detector which can simultaneously localize the face fiducial points and estimate the face pose. By adopting a different optimization criterion from existing algorithms, we are able to improve localization accuracy. To model other face variations, we use intra-personal/extra-personal dictionaries. The intra-personal/extra-personal modeling of human faces has been shown to work successfully in the Bayesian face recognition framework. It has additional advantages in scalability and generalization, which are of critical importance to real-world applications. Combining intra-personal/extra-personal models with dictionary learning enables us to achieve state-of-arts performance on unconstrained video data, even when the training data come from a different database. Finally, we present an approach for video-based face recognition using camera networks. The focus is on handling pose variations by applying the strength of the multi-view camera network. However, rather than taking the typical approach of modeling these variations, which eventually requires explicit knowledge about pose parameters, we rely on a pose-robust feature that eliminates the needs for pose estimation. The pose-robust feature is developed using the Spherical Harmonic (SH) representation theory. It is extracted using the surface texture map of a spherical model which approximates the subject's head. Feature vectors extracted from a video are modeled as an ensemble of instances of a probability distribution in the Reduced Kernel Hilbert Space (RKHS). The ensemble similarity measure in RKHS improves both robustness and accuracy of the recognition system. The proposed approach outperforms traditional algorithms on a multi-view video database collected using a camera network.Item FACE RECOGNITION AND VERIFICATION IN UNCONSTRAINED ENVIRIONMENTS(2012) Guo, Huimin; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition has been a long standing problem in computer vision. General face recognition is challenging because of large appearance variability due to factors including pose, ambient lighting, expression, size of the face, age, and distance from the camera, etc. There are very accurate techniques to perform face recognition in controlled environments, especially when large numbers of samples are available for each face (individual). However, face identification under uncontrolled( unconstrained) environments or with limited training data is still an unsolved problem. There are two face recognition tasks: face identification (who is who in a probe face set, given a gallery face set) and face verification (same or not, given two faces). In this work, we study both face identification and verification in unconstrained environments. Firstly, we propose a face verification framework that combines Partial Least Squares (PLS) and the One-Shot similarity model[1]. The idea is to describe a face with a large feature set combining shape, texture and color information. PLS regression is applied to perform multi-channel feature weighting on this large feature set. Finally the PLS regression is used to compute the similarity score of an image pair by One-Shot learning (using a fixed negative set). Secondly, we study face identification with image sets, where the gallery and probe are sets of face images of an individual. We model a face set by its covariance matrix (COV) which is a natural 2nd-order statistic of a sample set.By exploring an efficient metric for the SPD matrices, i.e., Log-Euclidean Distance (LED), we derive a kernel function that explicitly maps the covariance matrix from the Riemannian manifold to Euclidean space. Then, discriminative learning is performed on the COV manifold: the learning aims to maximize the between-class COV distance and minimize the within-class COV distance. Sparse representation and dictionary learning have been widely used in face recognition, especially when large numbers of samples are available for each face (individual). Sparse coding is promising since it provides a more stable and discriminative face representation. In the last part of our work, we explore sparse coding and dictionary learning for face verification application. More specifically, in one approach, we apply sparse representations to face verification in two ways via a fix reference set as dictionary. In the other approach, we propose a dictionary learning framework with explicit pairwise constraints, which unifies the discriminative dictionary learning for pair matching (face verification) and classification (face recognition) problems.Item Dense Wide-Baseline Stereo with Varying Illumination and its Application to Face Recognition(2012) Castillo, Carlos Domingo; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)We study the problem of dense wide baseline stereo with varying illumination. We are motivated by the problem of face recognition across pose. Stereo matching allows us to compare face images based on physically valid, dense correspondences. We show that the stereo matching cost provides a very robust measure of the similarity of faces that is insensitive to pose variations. We build on the observation that most illumination insensitive local comparisons require the use of relatively large windows. The size of these windows is affected by foreshortening. If we do not account for this effect, we incur misalignments that are systematic and significant and are exacerbated by wide baseline conditions. We present a general formulation of dense wide baseline stereo with varying illumination and provide two methods to solve them. The first method is based on dynamic programming (DP) and fully accounts for the effect of slant. The second method is based on graph cuts (GC) and fully accounts for the effect of both slant and tilt. The GC method finds a global solution using the unary function from the general formulation and a novel smoothness term that encodes surface orientation. Our experiments show that DP dense wide baseline stereo achieves superior performance compared to existing methods in face recognition across pose. The experiments with the GC method show that accounting for both slant and tilt can improve performance in situations with wide baselines and lighting variation. Our formulation can be applied to other more sophisticated window based image comparison methods for stereo.Item Techniques for Image Retrieval: Deformation Insensitivity and Automatic Thumbnail Cropping(2006-08-03) Ling, Haibin; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)We study several problems in image retrieval systems. These problems and proposed techniques are divided into three parts. Part I: This part focuses on robust object representation, which is of fundamental importance in computer vision. We target this problem without using specific object models. This allows us to develop methods that can be applied to many different problems. Three approaches are proposed that are insensitive to different kind of object or image changes. First, we propose using the inner-distance, defined as the length of shortest paths within shape boundary, to build articulation insensitive shape descriptors. Second, a deformation insensitive framework for image matching is presented, along with an insensitive descriptor based on geodesic distances on image surfaces. Third, we use a gradient orientation pyramid as a robust face image representation and apply it to the task of face verification across ages. Part II: This part concentrates on comparing histogram-based descriptors that are widely used in image retrieval. We first present an improved algorithm of the Earth Mover's Distance (EMD), which is a popular dissimilarity measure between histograms. The new algorithm is one order faster than original EMD algorithms. Then, motivated by the new algorithm, a diffusion-based distance is designed that is more straightforward and efficient. The efficiency and effectiveness of the proposed approaches are validated in experiments on both shape recognition and interest point matching tasks, using both synthetic and real data. Part III: This part studies the thumbnail generation problem that has wide application in visualization tasks. Traditionally, thumbnails are generated by shrinking the original images. These thumbnails are often illegible due to size limitation. We study the ability of computer vision systems to detect key components of images so that intelligent cropping, prior to shrinking, can render objects more recognizable. With this idea, we propose an automatic thumbnail cropping technique based on the distribution of pixel saliency in an image. The proposed approach is tested in a carefully designed user study, which shows that the cropped thumbnails are substantially more recognizable and easier to find in the context of visual search.