Electrical & Computer Engineering Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2765

Browse

Search Results

Now showing 1 - 5 of 5
  • Item
    Towards a Fast and Accurate Face Recognition System from Deep Representations
    (2019) Ranjan, Rajeev; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The key components of a machine perception algorithm are feature extraction followed by classification or regression. The features representing the input data should have the following desirable properties: 1) they should contain the discriminative information required for accurate classification, 2) they should be robust and adaptive to several variations in the input data due to illumination, translation/rotation, resolution, and input noise, 3) they should lie on a simple manifold for easy classification or regression. Over the years, researchers have come up with various hand crafted techniques to extract meaningful features. However, these features do not perform well for data collected in unconstrained settings due to large variations in appearance and other nuisance factors. Recent developments in deep convolutional neural networks (DCNNs) have shown impressive performance improvements in various machine perception tasks such as object detection and recognition. DCNNs are highly non-linear regressors because of the presence of hierarchical convolutional layers with non-linear activation. Unlike the hand crafted features, DCNNs learn the feature extraction and feature classification/regression modules from the data itself in an end-to-end fashion. This enables the DCNNs to be robust to variations present in the data and at the same time improve their discriminative ability. Ever-increasing computation power and availability of large datasets have led to significant performance gains from DCNNs. However, these developments in deep learning are not directly applicable to the face analysis tasks due to large variations in illumination, resolution, viewpoint, and attributes of faces acquired in unconstrained settings. In this dissertation, we address this issue by developing efficient DCNN architectures and loss functions for multiple face analysis tasks such as face detection, pose estimation, landmarks localization, and face recognition from unconstrained images and videos. In the first part of this dissertation, we present two face detection algorithms based on deep pyramidal features. The first face detector, called DP2MFD, utilizes the concepts of deformable parts model (DPM) in the context of deep learning. It is able to detect faces of various sizes and poses in unconstrained conditions. It reduces the gap in training and testing of DPM on deep features by adding a normalization layer to the DCNN. The second face detector, called Deep Pyramid Single Shot Face Detector (DPSSD), is fast and capable of detecting faces with large scale variations (especially tiny faces). It makes use of the inbuilt pyramidal hierarchy present in a DCNN, instead of creating an image pyramid. Extensive experiments on publicly available unconstrained face detection datasets show that both these face detectors are able to capture the meaningful structure of faces and perform significantly better than many traditional face detection algorithms. In the second part of this dissertation, we present two algorithms for simultaneous face detection, landmarks localization, pose estimation and gender recognition using DCNNs. The first method called, HyperFace, fuses the intermediate layers of a DCNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. The second approach extends HyperFace to incorporate additional tasks of face verification, age estimation, and smile detection, in All-In-One Face. HyperFace and All-In-One Face exploit the synergy among the tasks which improves individual performances. In the third part of this dissertation, we focus on improving the task of face verification by designing a novel loss function that maximizes the inter-class distance and minimizes the intraclass distance in the feature space. We propose a new loss function, called Crystal Loss, that adds an L2-constraint to the feature descriptors which restricts them to lie on a hypersphere of a fixed radius. This module can be easily implemented using existing deep learning frameworks. We show that integrating this simple step in the training pipeline significantly boosts the performance of face verification. We additionally describe a deep learning pipeline for unconstrained face identification and verification which achieves state-of-the-art performance on several benchmark datasets. We provide the design details of the various modules involved in automatic face recognition: face detection, landmark localization and alignment, and face identification/verification. We present experimental results for end-to-end face verification and identification on IARPA Janus Benchmarks A, B and C (IJB-A, IJB-B, IJB-C), and the Janus Challenge Set 5 (CS5). Though DCNNs have surpassed human-level performance on tasks such as object classification and face verification, they can easily be fooled by adversarial attacks. These attacks add a small perturbation to the input image that causes the network to mis-classify the sample. In the final part of this dissertation, we focus on safeguarding the DCNNs and neutralizing adversarial attacks by compact feature learning. In particular, we show that learning features in a closed and bounded space improves the robustness of the network. We explore the effect of Crystal Loss, that enforces compactness in the learned features, thus resulting in enhanced robustness to adversarial perturbations. Additionally, we propose compact convolution, a novel method of convolution that when incorporated in conventional CNNs improves their robustness. Compact convolution ensures feature compactness at every layer such that they are bounded and close to each other. Extensive experiments show that Compact Convolutional Networks (CCNs) neutralize multiple types of attacks, and perform better than existing methods in defending adversarial attacks, without incurring any additional training overhead compared to CNNs.
  • Item
    Demonstrating Cognition by Task Execution and Motion Planning with different algorithms for Manipulation
    (2018) DIMITRIADIS, DIMITRIOS; Baras, John S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this Thesis we demonstrate the whole path until the manipulation and the planning of the Baxter Robot. We start by analyzing the kinematic analysis of a six degrees of freedom robot. We build our analysis starting from the Denavit-Hartenberg method. We proceed with the kinematic equations of the robot and with the inverse kinematics as well as with a kinematic simulation of its movement with matlab. In order to reach our final goal we continue with the kinematic and dynamic analysis of the Baxter robot. We again state the Denavit-Hartenberg matrix, but this time we continue by building the dynamic model of the Baxter robot through the Euler-Lagrange equations. Moving on, we explore planning algorithms. The knowledge of which will help us in order to finally be able to formulate our path planner for the Baxter robot. We experiment ourselves by implementing four planning algorithms in different path planning problems. We construct the RRT and the RRT* algorithms in Python and we process them in different planning problems. Moving on, we also implement a planning problem in which Q-Learning and Sarsa algorithms are being used. We demonstrate how those two planning and learning algorithms work in our specified problem and we compare our results. Having knowledge on dynamic and kinematic robotic analysis and planning and motion planning algorithms we then experiment ourselves with the Baxter simulator on Gazebo. Also we plan the Baxter robot with Moveit!, getting familiar with the use of ROS as well as with the software. We add obstacles in our world and we plan our Baxter robot measuring its speed. We finally build a different plan algorithm RRT+ by focusing on searching for a secure and realizable path plan starting from the lower dimension space and then adding degrees of freedom to our Baxter robot. Concluding, we have built the desired steps for someone in order to build up the required knowledge to deal with robots and artificial intelligence planning.
  • Item
    Towards robust and domain invariant feature representations in Deep Learning
    (2018) Sankaranarayanan, Swaminathan; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    A fundamental problem in perception-based systems is to define and learn representations of the scene that are more robust and adaptive to several nuisance factors. Over the recent past, for a variety of tasks involving images, learned representations have been empirically shown to outperform handcrafted ones. However, their inability to generalize across varying data distributions poses the following question: Do representations learned using deep networks just fit a given data distribution or do they sufficiently model the underlying structure of the problem ? This question could be understood using a simple example: If a learning algorithm is shown a number of images of a simple handwritten digit, then the representation learned should be generic enough to identify the same digit in a different form. With regards to deep networks, although the learned representation has been shown to be robust to various forms of synthetic distortions such as random noise, they fail in the presence of more implicit forms of naturally occurring distortions. In this dissertation, we propose approaches to mitigate the effect of such distortions and in the process, study some vulnerabilities of deep networks to small imperceptible changes that occur in the given input. The research problems that comprise this dissertation lie in the cross section of two open topics: (1) Studying and developing methods that enable neural networks learn robust representations (2) Improving generalization of neural nets across domains. The first part of the dissertation approaches the problem of robustness from two broad viewpoints: Robustness to external nuisance factors that occur in the data and robustness (or a lack thereof) to perturbations of the learned feature space. In the second part, we focus on learning representations that are invariant to external covariate shift, which is more commonly termed as domain shift. Towards learning representations robust to external nuisance factors, we propose an approach that couples a deep convolutional neural network with a low-dimensional discriminative embedding learned using triplet probability constraints to solve the unconstrained face analysis problem. While previous approaches in this area have proposed scalable yet ad-hoc solutions to this problem, we propose a principled and parameter free formulation which is based on maximum likelihood estimation. In addition, we employ the principle of transfer learning to realize a deep network architecture that can train faster and on lesser data yet significantly outperforms existing approaches on the unconstrained face verification task. We demonstrate the robustness of the approach to challenges including age, pose, blur and clutter by performing clustering experiments on challenging benchmarks. Recent seminal works have shown that deep neural networks are susceptible to visually imperceptible perturbations of the input. In this dissertation, we build on their ideas in two unique ways: (a) We show that neural networks that perform pixel-wise semantic segmentation tasks also suffer from this vulnerability, despite being trained with more extra information compares to simple classification tasks. In addition, we present a novel self correcting mechanism in segmentation networks and provide an efficient way to generate such perturbations (b) We present a novel approach to regularize deep neural networks by perturbing intermediate layer activations in an efficient manner, thereby exploring the trade-off between conventional regularization and adversarial robustness within the context of very deep networks. Both of these works provide interesting directions towards understanding the secure nature of deep learning algorithms. While humans find it extremely simple to generalize their knowledge across domains, machine learning algorithms including deep neural networks suffer from the problem of domain shift across what are commonly termed as 'source' (S) and 'target' (T) distributions. Let the data that a learning algorithm is trained on be sampled from S. If the real data used to evaluate the model is then sampled from T, then the learnt model under-performs on the target data. This inability to generalize is characterized as domain shift. Our attempt to address this problem involves learning a common feature subspace, where distance between source and target distributions are minimized. Estimating the distance between different domains is highly non-trivial and is an open research problem in itself. In our approach we parameterize the distance measure by using a Generative Adversarial Network (GAN). A GAN involves a two player game between two mappings com- monly termed as generator and discriminator. These mappings are learned simultaneously by employing an adversarial game, i.e. by letting the generator fool the discriminator and enabling the discriminator to outperform the generator. This adversarial game can be formulated as a minimax problem. In our approach, we learn three mappings simultaneously: the generator, discriminator and a feature mapping that contains information about both the content and the domain of the input. We deploy a two-level minimax game, where the first level is a competition between the generator and a discriminator similar to a GAN; the second level game is where the feature mapping attempts to fool the discriminator thereby introducing domain invariance in the learned feature representation. We have extensively evaluated this approach for different tasks such as object classification and semantic segmentation, where we achieve state of the art results across several real datasets. In addition to the conceptual novelty, our approach presents a more efficient and scalable solution compared to other approaches that attempt to solve the same problem. In the final part of this dissertation, we describe some ongoing efforts and future directions of research. Inspired from the study of perturbations described above, we propose a novel metric on how to effectively choose pixels to label given an image, for a pixel-wise segmentation task. This has the potential to significantly reduce the labeling effort and our preliminary results for the task of semantic segmentation are encouraging. While the domain adaptation approach proposed above considered static images, we propose an extension to video data aided by the use of recurrent neural networks. Use of full temporal information, when available, provides the perceptual system additional context to disambiguate among smaller object classes that commonly occur in real scenes.
  • Item
    Statistical and Geometric Modeling of Spatio-Temporal Patterns for Video Understanding
    (2009) Turaga, Pavan; Chellappa, Ramalingam; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Spatio-temporal patterns abound in the real world, and understanding them computationally holds the promise of enabling a large class of applications such as video surveillance, biometrics, computer graphics and animation. In this dissertation, we study models and algorithms to describe complex spatio-temporal patterns in videos for a wide range of applications. The spatio-temporal pattern recognition problem involves recognizing an input video as an instance of a known class. For this problem, we show that a first order Gauss-Markov process is an appropriate model to describe the space of primitives. We then show that the space of primitives is not a Euclidean space but a Riemannian manifold. We use the geometric properties of this manifold to define distances and statistics. This then paves the way to model temporal variations of the primitives. We then show applications of these techniques in the problem of activity recognition and pattern discovery from long videos. The pattern discovery problem on the other hand, requires uncovering patterns from large datasets in an unsupervised manner for applications such as automatic indexing and tagging. Most state-of-the-art techniques index videos according to the global content in the scene such as color, texture and brightness. In this dissertation, we discuss the problem of activity based indexing of videos. We examine the various issues involved in such an effort and describe a general framework to address the problem. We then design a cascade of dynamical systems model for clustering videos based on their dynamics. We augment the traditional dynamical systems model in two ways. Firstly, we describe activities as a cascade of dynamical systems. This significantly enhances the expressive power of the model while retaining many of the computational advantages of using dynamical models. Secondly, we also derive methods to incorporate view and rate-invariance into these models so that similar actions are clustered together irrespective of the viewpoint or the rate of execution of the activity. We also derive algorithms to learn the model parameters from a video stream and demonstrate how a given video sequence may be segmented into different clusters where each cluster represents an activity. Finally, we show the broader impact of the algorithms and tools developed in this dissertation for several image-based recognition problems that involve statistical inference over non-Euclidean spaces. We demonstrate how an understanding of the geometry of the underlying space leads to methods that are more accurate than traditional approaches. We present examples in shape analysis, object recognition, video-based face recognition, and age-estimation from facial features to demonstrate these ideas.
  • Item
    Towards markerless motion capture: model estimation, initialization and tracking
    (2007-08-01) Sundaresan, Aravind; Chellappa, Ramalingam; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Motion capture is an important application in diverse areas such as bio-mechanics, computer animation, and human-computer interaction. Current motion capture methods use markers that are attached to the body of the subject and are therefore intrusive. In applications such as pathological human movement analysis, these markers may introduce unknown artifacts in the motion and are, in general, cumbersome. We present a computer vision based system for markerless human motion capture that uses images obtained from multiple synchronized and calibrated cameras. We model the human body as a set of rigid segments connected in articulated chains. We use a volumetric representation (voxels) of the subject using images obtained from the cameras in our work. We propose a novel, bottom-up approach to segment the voxels into different articulated chains based on their mutual connectivity, by mapping the voxels into Laplacian Eigenspace. We prove properties of the mapping that show that it is ideal for mapping voxels on non-rigid chains in normal space to nodes that lie on smooth 1D curves in Laplacian Eigenspace. We then use a 1D spline fitting procedure to segment the nodes according to which 1D curve they belong to. The segmentation is followed by a top-down approach that uses our knowledge of the structure of the human body to register the segmented voxels to different articulated chains such as the head, trunk and limbs. We propose a hierarchical algorithm to simultaneously initialize and estimate the pose and body model parameters for the subject. Finally, we propose a tracking algorithm that uses the estimated human body model and the initialized pose for a single frame of a given sequence to track the pose for the remainder of the frames. The tracker uses an iterative algorithm to estimate the pose, that combines both motion and shape cues in a predictor-corrector framework. The motion and shape cues complement each other and overcome drift and local minima problems. We provide results on 3D laser scans, synthetic data, and real video sequences with different subjects for our segmentation, model and pose estimation algorithms.