Electrical & Computer Engineering Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2765

Browse

Search Results

Now showing 1 - 2 of 2
  • Item
    Detecting and Recognizing Humans, Objects, and their Interactions
    (2020) Bansal, Ankan; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Scene understanding is a high-level vision task which involves not just localizing and recognizing objects and people but also inferring their layouts and interactions with each other. However, current systems for even atomic tasks like object detection suffer from several shortcomings. Most object detectors can only detect a limited number of object categories; face recognition systems are prone to make mistakes for faces in extreme poses or illuminations; and automated systems for detecting interactions between humans and objects perform poorly. We hypothesize that scene understanding can be improved by using additional semantic data from outside sources and intelligently and efficiently using the available data. Given the fact that it is nearly impossible to collect labeled training data for thousands of object categories, we introduce the problem of zero-shot object detection (ZSD). Here, “zero-shot” means recognizing/detecting without using any visual data during training. We first present an approach for ZSD using semantic information encoded in word-vectors which are trained on a large text corpus. We discuss some challenges associated with ZSD. The most important of these challenges is the definition of a “background” class in this setting. It is easy to define a “background” class in fully-supervised settings. However, it’s not clear what constitutes a “background” ZSD. We present principled approaches for dealing with this challenge and evaluate our approaches on challenging sets of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. Next, we tackle the problem of detecting human-object interactions (HOIs). Here, again, it is impossible to collect labeled data for each type of possible interaction. We show that solutions for HOI detection can greatly benefit from semantic information. We present two approaches for solving this problem. In the first approach, we exploit functional similarities between objects to share knowledge between models for different classes. The main idea is that humans look similar while interacting with functionally similar objects. We show that, using this idea, even a simple model can achieve state-of-the-art results for HOI detection both in the supervised and zero-shot settings. Our second model uses semantic information in the form of spatial layout of a person and an object to detect their interactions. This model contains a layout module which primes the visual module to make the final prediction. An automated scene understanding system should, further, be able to answer natural language questions posed by humans about a scene. We introduce the problem of Image-Set Visual Question Answering (ISVQA) as a generalization of existing tasks of Visual Question Answering (VQA) for still images, and video VQA. We describe two large-scale datasets collected for this problem: one for indoor scenes and one for outdoor scenes. We provide a comprehensive analysis of the two datasets. We also adapt VQA models to design baselines for this task and demonstrate the difficulty of the problem. Finally, we present new datasets for training face recognition systems. Using these datasets, we show that careful consideration of some critical questions before training can lead to significant improvements in face verification performance. We use some lessons from these experiments to train a face recognition system which can identify and verify faces accurately. We show that our model, trained with the recently introduced Crystal Loss, can achieve state-of-the-art performance for many challenging face recognition benchmarks like IJB-A, IJB-B, and IJB-C. We evaluate our system on the Disguised Faces in the Wild (DFW) dataset and show convincing first results.
  • Item
    RECOGNITION OF FACES FROM SINGLE AND MULTI-VIEW VIDEOS
    (2014) Du, Ming; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Face recognition has been an active research field for decades. In recent years, with videos playing an increasingly important role in our everyday life, video-based face recognition has begun to attract considerable research interest. This leads to a wide range of potential application areas, including TV/movies search and parsing, video surveillance, access control etc. Preliminary research results in this field have suggested that by exploiting the abundant spatial-temporal information contained in videos, we can greatly improve the accuracy and robustness of a visual recognition system. On the other hand, as this research area is still in its infancy, developing an end-to-end face processing pipeline that can robustly detect, track and recognize faces remains a challenging task. The goal of this dissertation is to study some of the related problems under different settings. We address the video-based face association problem, in which one attempts to extract face tracks of multiple subjects while maintaining label consistency. Traditional tracking algorithms have difficulty in handling this task, especially when challenging nuisance factors like motion blur, low resolution or significant camera motions are present. We demonstrate that contextual features, in addition to face appearance itself, play an important role in this case. We propose principled methods to combine multiple features using Conditional Random Fields and Max-Margin Markov networks to infer labels for the detected faces. Different from many existing approaches, our algorithms work in online mode and hence have a wider range of applications. We address issues such as parameter learning, inference and handling false positves/negatives that arise in the proposed approach. Finally, we evaluate our approach on several public databases. We next propose a novel video-based face recognition framework. We address the problem from two different aspects: To handle pose variations, we learn a Structural-SVM based detector which can simultaneously localize the face fiducial points and estimate the face pose. By adopting a different optimization criterion from existing algorithms, we are able to improve localization accuracy. To model other face variations, we use intra-personal/extra-personal dictionaries. The intra-personal/extra-personal modeling of human faces has been shown to work successfully in the Bayesian face recognition framework. It has additional advantages in scalability and generalization, which are of critical importance to real-world applications. Combining intra-personal/extra-personal models with dictionary learning enables us to achieve state-of-arts performance on unconstrained video data, even when the training data come from a different database. Finally, we present an approach for video-based face recognition using camera networks. The focus is on handling pose variations by applying the strength of the multi-view camera network. However, rather than taking the typical approach of modeling these variations, which eventually requires explicit knowledge about pose parameters, we rely on a pose-robust feature that eliminates the needs for pose estimation. The pose-robust feature is developed using the Spherical Harmonic (SH) representation theory. It is extracted using the surface texture map of a spherical model which approximates the subject's head. Feature vectors extracted from a video are modeled as an ensemble of instances of a probability distribution in the Reduced Kernel Hilbert Space (RKHS). The ensemble similarity measure in RKHS improves both robustness and accuracy of the recognition system. The proposed approach outperforms traditional algorithms on a multi-view video database collected using a camera network.