A. James Clark School of Engineering
Permanent URI for this communityhttp://hdl.handle.net/1903/1654
The collections in this community comprise faculty research works, as well as graduate theses and dissertations.
Browse
33 results
Search Results
Item SYNPLAY: IMPORTING REAL-WORLD DIVERSITY FOR A SYNTHETIC HUMAN DATASET(2024) Yim, Jinsub; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In response to the growing demand for large-scale training data, synthetic datasets have emerged as practical solutions. However, existing synthetic datasets often fall short of replicating the richness and diversity of real-world data. Synthetic Playground (SynPlay) is introduced as a new synthetic human dataset that aims to bring out the diversity of human appearance in the real world. In this thesis, We focus on two factors to achieve a level of diversity that has not yet been seen in previous works: i) realistic human motions and poses and ii) multiple camera viewpoints towards human instances. We first use a game engine and its library-provided elementary motions to create games where virtual players can take less-constrained and natural movements while following the game rules (i.e., rule-guided motion design as opposed to detail-guided design). We then augment the elementary motions with real human motions captured with a motion capture device. To render various human appearances in the games from multiple viewpoints, we use seven virtual cameras encompassing the ground and aerial views, capturing abundant aerial-vs-ground and dynamic-vs-static attributes of the scene. Through extensive and carefully-designed experiments, we show that using SynPlay in model training leads to enhanced accuracy over existing synthetic datasets for human detection and segmentation. Moreover, the benefit of SynPlay becomes even greater for tasks in the data-scarce regime, such as few-shot and cross-domain learning tasks. These results clearly demonstrate that SynPlay can be used as an essential dataset with rich attributes of complex human appearances and poses suitable for model pretraining.Item Leveraging Deep Generative Models for Estimation and Recognition(2023) PNVR, Koutilya; Jacobs, David W.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Generative models are a class of statistical models that estimate the joint probability distribution on a given observed variable and a target variable. In computer vision, generative models are typically used to model the joint probability distribution of a set of real image samples assumed to be on a complex high-dimensional image manifold. The recently proposed deep generative architectures such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models (DMs) were shown to generate photo-realistic images of human faces and other objects. These generative models also became popular for other generative tasks such as image editing, text-to-image, etc. As appealing as the perceptual quality of the generated images has become, the use of generative models for discriminative tasks such as visual recognition or geometry estimation has not been well studied. Moreover, with different kinds of powerful generative models getting popular lately, it's important to study their significance in other areas of computer vision. In this dissertation, we demonstrate the advantages of using generative models for applications that go beyond just photo-realistic image generation: Unsupervised Domain Adaptation (UDA) between synthetic and real datasets for geometry estimation; Text-based image segmentation for recognition. In the first half of the dissertation, we propose a novel generative-based UDA method for combining synthetic and real images when training networks to determine geometric information from a single image. Specifically, we use a GAN model to map both synthetic and real domains into a shared image space by translating just the domain-specific task-related information from respective domains. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Compared to previous approaches, we demonstrate an improved domain gap reduction and much better generalization between synthetic and real data for geometry estimation tasks such as monocular depth estimation and face normal estimation. In the second half of the dissertation, we showcase the power of a recent class of generative models for improving an important recognition task: text-based image segmentation. Specifically, large-scale pre-training tasks like image classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foundation models built using text-based latent diffusion techniques may learn semantic boundaries. This is because they must synthesize intricate details about all objects in an image based on a text description. Therefore, we present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.Item Activity Detection in Untrimmed Videos(2023) Gleason, Joshua D; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, we present solutions to the problem of activity detection in untrimmed videos, where we are interested in identifying both when and where various activity instances occur within an unconstrained video. Advances in machine learning, particularly the widespread adoption of deep learning-based methods have yielded robust solutions to a number of historically difficult computer vision application domains. For example, recent systems for object recognition and detection, facial identification, and a number of language processing applications have found widespread commercial success. In some cases, such systems have been able to outperform humans. The same cannot be said for the problem of activity detection in untrimmed videos. This dissertation describes our investigation and innovative solutions for the challenging problem of real-time activity detection in untrimmed videos. The main contributions of our work are the introduction of multiple novel activity detection systems that make strides toward the goal of commercially viable activity detection. The first work introduces a proposal mechanism based on divisive hierarchical clustering of objects to produce cuboid activity proposals, followed by a classification and temporal refinement step. The second work proposes a chunk-based processing mechanism and explores the tradeoff between tube and cuboid proposals. The third work explores the topic of real-time activity detection and introduces strategies for achieving this performance. The final work provides a detailed look into multiple novel extensions that improve upon the state-of-the-art in the field.Item Investigation of Swirl Distributed Combustion with Experimental Diagnostics and Artificial Intelligence Approach(2022) Roy, Rishi; Gupta, Ashwani K; Mechanical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Swirl Distributed Combustion was fundamentally investigated with experimental diagnostics and predictive analysis using machine learning and computer vision techniques. Ultra-low pollutants emission, stable operation, improved pattern factor, and fuel flexibility make distributed combustion an attractive technology for potential applications in high-intensity stationary gas turbines. Proper mixing of inlet fresh air and hot products for creating a hot and low-oxygen environment is critical to foster distributed combustion, followed by rapid mixing with the fuel. Such conditions result in a distributed thick reaction zone without hotspots found in (thin reaction front) of conventional diffusion flames leading to reduced NOx and CO emissions. The focus of this dissertation is to develop a detailed fundamental understanding of distributed combustion in a lab-based swirl combustor (to mimic gas turbine can combustor) at moderate heat release intensities in the range 5.72- 9.53 MW/m3-atm using various low-carbon gaseous fuels such as methane, propane, hydrogen-enriched fuels. The study of distributed combustion at moderate thermal intensity helped to understand the fundamental aspects such as reduction of flame fluctuation, mitigation of thermo-acoustic instability, flame shape evolution, flow field behavior, turbulence characteristics, variation of Damkӧhler number, vortex propagation, flame blowoff, and pollutant and CO2 emission reduction with gradual mixture preparation. Initial efforts were made to obtain the volumetric distribution ratio, evolution of flame shape in terms of OH* radical imaging, variation of flame standoff, thermal field uniformity, and NO and CO emissions when the flame transitions to distributed reaction zone. Further investigation was performed to study the mitigation of flame thermo-acoustics and precession vortex core (PVC) instabilities in swirl distributed combustion compared to swirl air combustion using the acoustic pressure and qualitative heat release fluctuation data at different dilution CO2 dilution levels with and without air preheats. Proper orthogonal decomposition (POD) technique was utilized to visualize the appearance of dynamic coherent structures in reactive flow fields and reduction of fluctuation energy. Vortex shedding was found responsible for the fluctuation in swirl air combustion while no significant flame fluctuation was observed in distributed combustion. Distributed combustion showed significantly reduced acoustic noise and much higher stability quantified by local and global Rayleigh index. This study was extended with hydrogen-enriched methane (vol. = 0, 10, 20, 40% H2) to compare the stability of the flow field in conventional air combustion and distributed combustion. Results were consistent and distributed reaction zones showed higher flame stability compared to conventional swirl air combustion. The study of lean blowoff in distributed combustion showed a higher lean blowoff equivalence ratio with gradual increase in heat release intensity, which was attributed to higher flow field instability due to enhanced inlet turbulence. Extension of lean blowoff (ϕLBO) was observed with gradual %H2 which showed decrease of lean blowoff equivalence ratio in distributed reaction zones. Additionally, the reduction in ϕLBO was achieved by adding preheats to the inlet airstream for different H2 enrichment cases due to enhanced flame stability gained from preheating. Examination of non-reactive flow field with particle image velocimetry (PIV) was performed to understand the fundamental differences between swirl flow and distributed reaction flow at constant heat release intensities. Higher rms fluctuation leading to healthy turbulence and higher Reynolds stress were found in distributed reaction flow cases signifying enhanced mixing characteristics in distributed combustion. Reduction of pollutant emission was an important focus of this research. Measurement of NO and CO emission at different mixture preparation levels exhibited significant reduction in NO emission (single digit) compared to swirl air combustion due to mitigation of spatial hotspots and temperature peaks. Additionally, better mixing and uniform stoichiometry supported reduced CO emissions in distributed combustion for every fuel. With increased H2 in the fuel, NO gradually increased for air combustion while reduction of NO was found in distributed combustion due to decrease in thermal and prompt NO generation. Finally, the use of machine learning and computer vision techniques was investigated for software-based prediction of combustion parameters (pollutants and flame temperature) and feature-based recognition of distributed combustion regimes. The primary goal of using artificial intelligence is to reduce the time of experimentation and frequent manual interference during experiments in order to enhance the overall accuracy by reducing human errors. Such predictions will help in developing data-driven smart-sensing of combustion parameters and reduce the dependence on experimental trials.Item FACIAL EXPRESSION RECOGNITION AND EDITING WITH LIMITED DATA(2020) Ding, Hui; Chellappa,, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Over the past five years, methods based on deep features have taken over the computer vision field. While dramatic performance improvements have been achieved for tasks such as face detection and verification, these methods usually need large amounts of annotated data. In practice, not all computer vision tasks have access to large amounts of annotated data. Facial expression analysis is such a task. In this dissertation, we focus on facial expression recognition and editing problems with small datasets. In addition, to cope with challenging conditions like pose and occlusion, we also study unaligned facial attribute detection and occluded expression recognition problems. This dissertation has been divided into four parts. In the first part, we present FaceNet2ExpNet, a novel idea to train a light-weight and high accuracy classification model for expression recognition with small datasets. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully-connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases demonstrate that our method achieves better results than state-of-the-art. In the second part, we focus on robust facial expression recognition under occlusion and propose a landmark-guided attention branch to find and discard corrupted feature elements from recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to the non-occluded regions. To further increase robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and enforce each block to predict the expression independently. Depending on the synergistic effect of the two branches, our occlusion adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets. In the third part, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression are further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30.9% In the final part of this dissertation, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN.Item Augmented Deep Representations for Unconstrained Still/Video-based Face Recognition(2019) Zheng, Jingxiao; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition is one of the active areas of research in computer vision and biometrics. Many approaches have been proposed in the literature that demonstrate impressive performance, especially those based on deep learning. However, unconstrained face recognition with large pose, illumination, occlusion and other variations is still an unsolved problem. Unconstrained video-based face recognition is even more challenging due to the large volume of data to be processed, lack of labeled training data and significant intra/inter-video variations on scene, blur, video quality, etc. Although Deep Convolutional Neural Networks (DCNNs) have provided discriminant representations for faces and achieved performance surpassing humans in controlled scenarios, modifications are necessary for face recognition in unconstrained conditions. In this dissertation, we propose several methods that improve unconstrained face recognition performance by augmenting the representation provided by the deep networks using correlation or contextual information in the data. For unconstrained still face recognition, we present an encoding approach to combine the Fisher vector (FV) encoding and DCNN representations, which is called FV-DCNN. The feature maps from the last convolutional layer in the deep network are encoded by FV into a robust representation, which utilizes the correlation between facial parts within each face. A VLAD-based encoding method called VLAD-DCNN is also proposed as an extension. Extensive evaluations on three challenging face recognition datasets show that the proposed FV-DCNN and VLAD-DCNN perform comparable to or better than many state-of-the-art face verification methods. For the more challenging video-based face recognition task, we first propose an automatic system and model the video-to-video similarity as subspace-to-subspace similarity, where the subspaces characterize the correlation between deep representations of faces in videos. In the system, a quality-aware subspace-to-subspace similarity is introduced, where subspaces are learned using quality-aware principal component analysis. Subspaces along with quality-aware exemplars of templates are used to produce the similarity scores between video pairs by a quality-aware principal angle-based subspace-to-subspace similarity metric. The method is evaluated on four video datasets. The experimental results demonstrate the superior performance of the proposed method. To utilize the temporal information in videos, a hybrid dictionary learning method is also proposed for video-based face recognition. The proposed unsupervised approach effectively models the temporal correlation between deep representations of video faces using dynamical dictionaries. A practical iterative optimization algorithm is introduced to learn the dynamical dictionary. Experiments on three video-based face recognition datasets demonstrate that the proposed method can effectively learn robust and discriminative representation for videos and improve the face recognition performance. Finally, to leverage contextual information in videos, we present the Uncertainty-Gated Graph (UGG) for unconstrained video-based face recognition. It utilizes contextual information between faces by conducting graph-based identity propagation between sample tracklets, where identity information are initialized by the deep representations of video faces. UGG explicitly models the uncertainty of the contextual connections between tracklets by adaptively updating the weights of the edge gates according to the identity distributions of the nodes during inference. UGG is a generic graphical model that can be applied at only inference time or with end-to-end training. We demonstrate the effectiveness of UGG with state-of-the-art results on the recently released challenging Cast Search in Movies and IARPA Janus Surveillance Video Benchmark datasets.Item DEEP INFERENCE ON MULTI-SENSOR DATA(2019) Ghosh, Arthita; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Computer vision-based intelligent autonomous systems engage various types of sensors to perceive the world they navigate in. Vision systems perceive their environments through inferences on entities (structures, humans) and their attributes (pose, shape, materials) that are sensed using RGB and Near-InfraRed (NIR) cameras, LAser Detection And Ranging (LADAR), radar and so on. This leads to challenging and interesting problems in efficient data-capture, feature extraction, and attribute estimation, not only for RGB but various other sensors. In some cases, we encounter very limited amounts of labeled training data. In certain other scenarios we have sufficient data, but annotations are unavailable for supervised learning. This dissertation explores two approaches to learning under conditions of minimal to no ground truth. The first approach applies projections on training data that make learning efficient by improving training dynamics. The first and second topics in this dissertation belong to this category. The second approach makes learning without ground-truth possible via knowledge transfer from a labeled source domain to an unlabeled target domain through projections to domain-invariant shared latent spaces. The third and fourth topics in this dissertation belong to this category. For the first topic we study the feasibility and efficacy of identifying shapes in LADAR data in several measurement modes. We present results on efficient parameter learning with less data (for both traditional machine learning as well as deep models) on LADAR images. We use a LADAR apparatus to obtain range information from a 3-D scene by emitting laser beams and collecting the reflected rays from target objects in the region of interest. The Agile Beam LADAR concept makes the measurement and interpretation process more efficient using a software-defined architecture that leverages computational imaging principles. Using these techniques, we show that object identification and scene understanding can be accurately performed in the LADARmeasurement domain thereby rendering the efforts of pixel-based scene reconstruction superfluous. Next, we explore the effectiveness of deep features extracted by Convolutional Neural Networks (CNNs) in the Discrete Cosine Transform (DCT) domain for various image classification tasks such as pedestrian and face detection, material identification and object recognition. We perform the DCT operation on the feature maps generated by convolutional layers in CNNs. We compare the performance of the same network with the same hyper-parameters with or without the DCT step. Our results indicate that a DCT operation incorporated into the network after the first convolution layer can have certain advantages such as convergence over fewer training epochs and sparser weight matrices that are more conducive to pruning and hashing techniques. Next, we present an adversarial deep domain adaptation (ADA)-based approach for training deep neural networks that fit 3Dmeshes on humans in monocular RGB input images. Estimating a 3D mesh from a 2D image is helpful in harvesting complete 3Dinformation about body pose and shape. However, learning such an estimation task in a supervised way is challenging owing to the fact that ground truth 3D mesh parameters for real humans do not exist. We propose a domain adaptation based single-shot (no re-projection, no iterative refinement), end-to-end training approach with joint optimization on real and synthetic images on a shared common task. Through joint inference on real and synthetic data, the network extracts domain invariant features that are further used to estimate the 3D mesh parameters in a single shot with no supervision on real samples. While we compute regression loss on synthetic samples with ground truth mesh parameters, knowledge is transferred from synthetic to real data through ADA without direct ground truth for supervision. Finally, we propose a partially supervised method for satellite image super-resolution by learning a unified representation of samples from different domains (captured by different sensors) in a shared latent space. The training samples are drawn from two datasets which we refer to as source and target domains. The source domain consists of fewer samples which are of higher resolution and contain very detailed and accurate annotations. In contrast, samples from the target domain are low-resolution and available ground truth is sparse. The pipeline consists of a feature extractor and a super-resolving module which are trained end-to-end. Using a deep feature extractor, we jointly learn (on two datasets) a common embedding space for all samples. Partial supervision is available for the samples in the source domain which have high-resolution ground truth. Adversarial supervision is used to successfully super-resolve low-resolution RGB satellite imagery from target domain without direct paired supervision from high resolution counterparts.Item Constraints and Priors for Inverse Rendering from Limited Observations(2019) SENGUPTA, SOUMYADIP; Jacobs, David W; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Inverse Rendering deals with recovering the underlying intrinsic components of an image, i.e. geometry, reflectance, illumination and the camera with which the image was captured. Inferring these intrinsic components of an image is a fundamental problem in Computer Vision. Solving Inverse Rendering unlocks a host of real world applications in Augmented and Virtual Reality, Robotics, Computational Photography, and gaming. Researchers have made significant progress in solving Inverse Rendering from a large number of images of an object or a scene under relatively constrained settings. However, most real life applications rely on a single or a small number of images captured in an unconstrained environment. Thus in this thesis, we explore Inverse Rendering under limited observations from unconstrained images. We consider two different approaches for solving Inverse Rendering under limited observations. First, we consider learning data-driven priors that can be used for Inverse Rendering from a single image. Our goal is to jointly learn all intrinsic components of an image, such that we can recombine them and train on unlabeled real data using self-supervised reconstruction loss. A key component that enables self-supervision is a differentiable rendering module that can combine the intrinsic components to accurately regenerate the image. We show how such a self-supervised reconstruction loss can be used for Inverse Rendering of faces. While this is relatively straightforward for faces, complex appearance effects (e.g. inter-reflections, cast-shadows, and near-field lighting) present in a scene can’t be captured with a differentiable rendering module. Thus we also propose a deep CNN based differentiable rendering module (Residual Appearance Renderer) that can capture these complex appearance effects and enable self-supervised learning. Another contribution is a novel Inverse Rendering architecture, SfSNet, that performs Inverse Rendering for faces and scenes. Second, we consider enforcing low-rank multi-view constraints in an optimization framework to enable Inverse Rendering from a few images. To this end, we propose a novel multi-view rank constraint that connects all cameras capturing all the images in a scene and is enforced to ensure accurate camera recovery. We also jointly enforce a low-rank constraint and remove ambiguity to perform accurate Uncalibrated Photometric Stereo from a few images. In these problems, we formulate a constrained low-rank optimization problem in the presence of noisy estimates and missing data. Our proposed optimization framework can handle this non-convex optimization using Alternate Direction Method of Multipliers (ADMM). Given a few images, enforcing low-rank constraints significantly improves Inverse Rendering.Item Towards a Fast and Accurate Face Recognition System from Deep Representations(2019) Ranjan, Rajeev; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The key components of a machine perception algorithm are feature extraction followed by classification or regression. The features representing the input data should have the following desirable properties: 1) they should contain the discriminative information required for accurate classification, 2) they should be robust and adaptive to several variations in the input data due to illumination, translation/rotation, resolution, and input noise, 3) they should lie on a simple manifold for easy classification or regression. Over the years, researchers have come up with various hand crafted techniques to extract meaningful features. However, these features do not perform well for data collected in unconstrained settings due to large variations in appearance and other nuisance factors. Recent developments in deep convolutional neural networks (DCNNs) have shown impressive performance improvements in various machine perception tasks such as object detection and recognition. DCNNs are highly non-linear regressors because of the presence of hierarchical convolutional layers with non-linear activation. Unlike the hand crafted features, DCNNs learn the feature extraction and feature classification/regression modules from the data itself in an end-to-end fashion. This enables the DCNNs to be robust to variations present in the data and at the same time improve their discriminative ability. Ever-increasing computation power and availability of large datasets have led to significant performance gains from DCNNs. However, these developments in deep learning are not directly applicable to the face analysis tasks due to large variations in illumination, resolution, viewpoint, and attributes of faces acquired in unconstrained settings. In this dissertation, we address this issue by developing efficient DCNN architectures and loss functions for multiple face analysis tasks such as face detection, pose estimation, landmarks localization, and face recognition from unconstrained images and videos. In the first part of this dissertation, we present two face detection algorithms based on deep pyramidal features. The first face detector, called DP2MFD, utilizes the concepts of deformable parts model (DPM) in the context of deep learning. It is able to detect faces of various sizes and poses in unconstrained conditions. It reduces the gap in training and testing of DPM on deep features by adding a normalization layer to the DCNN. The second face detector, called Deep Pyramid Single Shot Face Detector (DPSSD), is fast and capable of detecting faces with large scale variations (especially tiny faces). It makes use of the inbuilt pyramidal hierarchy present in a DCNN, instead of creating an image pyramid. Extensive experiments on publicly available unconstrained face detection datasets show that both these face detectors are able to capture the meaningful structure of faces and perform significantly better than many traditional face detection algorithms. In the second part of this dissertation, we present two algorithms for simultaneous face detection, landmarks localization, pose estimation and gender recognition using DCNNs. The first method called, HyperFace, fuses the intermediate layers of a DCNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. The second approach extends HyperFace to incorporate additional tasks of face verification, age estimation, and smile detection, in All-In-One Face. HyperFace and All-In-One Face exploit the synergy among the tasks which improves individual performances. In the third part of this dissertation, we focus on improving the task of face verification by designing a novel loss function that maximizes the inter-class distance and minimizes the intraclass distance in the feature space. We propose a new loss function, called Crystal Loss, that adds an L2-constraint to the feature descriptors which restricts them to lie on a hypersphere of a fixed radius. This module can be easily implemented using existing deep learning frameworks. We show that integrating this simple step in the training pipeline significantly boosts the performance of face verification. We additionally describe a deep learning pipeline for unconstrained face identification and verification which achieves state-of-the-art performance on several benchmark datasets. We provide the design details of the various modules involved in automatic face recognition: face detection, landmark localization and alignment, and face identification/verification. We present experimental results for end-to-end face verification and identification on IARPA Janus Benchmarks A, B and C (IJB-A, IJB-B, IJB-C), and the Janus Challenge Set 5 (CS5). Though DCNNs have surpassed human-level performance on tasks such as object classification and face verification, they can easily be fooled by adversarial attacks. These attacks add a small perturbation to the input image that causes the network to mis-classify the sample. In the final part of this dissertation, we focus on safeguarding the DCNNs and neutralizing adversarial attacks by compact feature learning. In particular, we show that learning features in a closed and bounded space improves the robustness of the network. We explore the effect of Crystal Loss, that enforces compactness in the learned features, thus resulting in enhanced robustness to adversarial perturbations. Additionally, we propose compact convolution, a novel method of convolution that when incorporated in conventional CNNs improves their robustness. Compact convolution ensures feature compactness at every layer such that they are bounded and close to each other. Extensive experiments show that Compact Convolutional Networks (CCNs) neutralize multiple types of attacks, and perform better than existing methods in defending adversarial attacks, without incurring any additional training overhead compared to CNNs.Item Machine Learning of Facial Attributes Using Explainable, Secure and Generative Adversarial Networks(2018) Samangouei, Pouya; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)"Attributes" are referred to abstractions that humans use to group entities and phenomena that have a common characteristic. In machine learning (ML), attributes are fundamental because they bridge the semantic gap between humans and ML systems. Thus, researchers have been using this concept to transform complicated ML systems into interactive ones. However, training the attribute detectors which are central to attribute-based ML systems can still be challenging. It might be infeasible to gather attribute labels for rare combinations to cover all the corner cases, which can result in weak detectors. Also, it is not clear how to fill in the semantic gap with attribute detectors themselves. Finally, it is not obvious how to interpret the detectors' outputs in the presence of adversarial noise. First, we investigate the effectiveness of attributes for bridging the semantic gap in complicated ML systems. We turn a system that does continuous authentication of human faces on mobile phones into an interactive attribute-based one. We employ deep multi-task learning in conjunction with multi-view classification using facial parts to tackle this problem. We show how the proposed system decomposition enables efficient deployment of deep networks for authentication on mobile phones with limited resources. Next, we seek to improve the attribute detectors by using conditional image synthesis. We take a generative modeling approach for manipulating the semantics of a given image to provide novel examples. Previous works condition the generation process on binary attribute existence values. We take this type of approaches one step further by modeling each attribute as a distributed representation in a vector space. These representations allow us to not only toggle the presence of attributes but to transfer an attribute style from one image to the other. Furthermore, we show diverse image generation from the same set of conditions, which was not possible using existing methods with a single dimension per attribute. We then investigate filling in the semantic gap between humans and attribute classifiers by proposing a new way to explain the pre-trained attribute detectors. We use adversarial training in conjunction with an encoder-decoder model to learn the behavior of binary attribute classifiers. We show that after our proposed model is trained, one can see which areas of the image contribute to the presence/absence of the target attribute, and also how to change image pixels in those areas so that the attribute classifier decision changes in a consistent way with human perception. Finally, we focus on protecting the attribute models from un-interpretable behaviors provoked by adversarial perturbations. These behaviors create an inexplainable semantic gap since they are visually unnoticeable. We propose a method based on generative adversarial networks to alleviate this issue. We learn the training data distribution that is used to train the core classifier and use it to detect and denoise test samples. We show that the method is effective for defending facial attribute detectors.