Electrical & Computer Engineering Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2765
Browse
18 results
Search Results
Item FROM PARTS TO WHOLE IN ACTION AND OBJECT UNDERSTANDING(2024) Devaraj, Chinmaya; Aloimonos, Yiannis; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The traditional paradigm of supervised learning in action or object recognition often relieson a top-down approach, ignoring explicit modeling of what activity or objects consist of. Recent approaches in generative AI research have shown us the ability to generate images and videos using text, indirectly indicating that we have control over the constituents of images and videos. In this dissertation, we explore ways to use the constituents of actions to develop methods to improve understanding of action. We devise different approaches to utilize the parts of actions, namely object motion, object state changes, and motion descriptions obtained by LLMs in various tasks like in the next active object segmentation, zero-shot action recognition, or video-text retrieval. We show promising benefits in action anticipation, zero-shot action recognition, and text-video retrieval tasks, demonstrating the practical applications of our methods. In the first part of the dissertation, we explore the idea of using the constituents of actions inGCNs for zero-shot human-object action recognition. The main idea is that semantically similar actions (of similar constituents) are closer in feature space. Thus, in our graph, we encode the edges connecting those actions with higher similarity. We introduce a method to visually ground the external knowledge graph using the concept of shared similarity between similar actions. We evaluate the method on the EPIC Kitchens dataset and the Charades dataset showing impressive results over baseline methods. We further show that visually grounding the knowledge graph enhances the performance of GCNs when an adversarial attack corrupts the input graph. In the second part of the thesis, we extend our ideas on human-object interactions in firstpersonvideos. Human actions involving hand manipulations are structured according to the making and breaking of hand-object contact, and human visual understanding of action relies on anticipation of contact, as demonstrated by pioneering work in cognitive science. Taking inspiration from this, we introduce representations and models centered on contact, which we then use in action prediction and anticipation. We train the Anticipation Module, a module producing Contact Anticipation Maps and Next Active Object Segmentations - novel low-level representations providing temporal and spatial characteristics of anticipated near future action. On top of the Anticipation Module, we apply Egocentric Object Manipulation Graphs (Ego- OMG), a framework for action anticipation and prediction. Using the Anticipation Module to aid Ego-OMG produces state-of-the-art results, achieving first and second places on the unseen and seen test sets of the EPIC Kitchens Action Anticipation Challenge and achieving state-of-the-art results on action anticipation and action prediction over EPIC Kitchens. In the same line of thinking of constituents of action, we next focus on investigatinghow motion understanding can be modeled in current video-text models. We introduce motion descriptions generated by GPT4 on three action datasets that capture fine-grained motion descriptions of activities. We evaluated several video-text models on the task of retrieval of motion descriptions and found them to need to catch up to the human expert performance. We introduce a method of improving motion understanding in video-text models by utilizing motion descriptions. This method is demonstrated on two action datasets for the motion description retrieval task. The results draw attention to the need for quality captions involving fine-grained motion information in existing datasets and demonstrate the effectiveness of the proposed pipeline in understanding fine-grained motion during video-text retrieval.Item Understanding and Improving Reliability of Predictive and Generative Deep Learning Models(2024) Kattakinda, Priyatham; Feizi, Soheil; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning models are prone to acquiring spurious correlations and biases during training and adversarial attacks during inference. In the context of predictive models, this results in inaccurate predictions relying on spurious features. Our research delves into this phenomenon specifically concerning objects placed in uncommon settings, where they are not conventionally found in the real world (e.g., a plane on water or a television in a cave). We introduce the "FOCUS: Familiar Objects in Common and Uncommon Settings" dataset which aims to stress-test the generalization capabilities of deep image classifiers. By leveraging the power of modern search engines, we deliberately gather data containing objects in common and uncommon settings in a wide range of locations, weather conditions, and time of day. Our comprehensive analysis of popular image classifiers on the FOCUS dataset reveals a noticeable decline in performance when classifying images in atypical scenarios. FOCUS only consists of natural images which are extremely challenging to collect as by definition it is rare to find objects in unusual settings. To address this challenge, we introduce an alternative dataset named Diffusion Dreamed Distribution Shifts (D3S). D3S comprises synthetic images generated through StableDiffusion, utilizing text prompts and image guides derived from placing a sample foreground image onto a background template image. This scalable approach allows us to create 120,000 images featuring objects from all 1000 ImageNet classes set against 10 diverse backgrounds. Due to the incredible photorealism of the diffusion model, our images are much closer to natural images than previous synthetic datasets. To alleviate this problem, we propose two methods of learning richer and more robust image representations. In the first approach, we harness the foreground and background labels within D3S to learn a foreground (background)representation resistant to changes in background (foreground). This is achieved by penalizing the mutual information between the foreground (background) features and the background (foreground) labels. We demonstrate the efficacy of these representations by training classifiers on a task with strong spurious correlations. Thus far, our focus has centered on predictive models, scrutinizing the robustness of the learned object representations, particularly when the contextual surroundings are unconventional. In the second approach, we propose to use embeddings of objects and their relationships extracted using off-the-shelf image segmentation models and text encoders respectively as input tokens to a transformer. This leads to remarkably richer features that improve performance on downstream tasks such as image retrieval. Large language models are also prone to failures during inference. Given the widespread use of LLMs, understanding the propensity of these models to fail given adversarial inputs is crucial. To that end we propose a series of fast adversarial attacks called BEAST that uses beam search to add adversarial tokens to a given input prompt. These attacks induce hallucination, cause the models to jailbreak and facilitate unintended membership inference from model outputs. Our attacks are fast and are executable in relatively compute constrained environments.Item DEEP LEARNING ENSEMBLES FOR LIGHTWEIGHT OBJECT DETECTION(2023) Mattingly, Alexander Singfei; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object detection, the task of identifying and localizing important objectswithin an image frame, is a critical task in automation, surveillance, and safety applications. Further, developments in lightweight sensor technologies, improved small-scale computing, and the widespread accessibility of well-labeled data have enabled numerous applications for object detection on inexpensive or low-power hardware. Many applications, such as self-driving and unmanned aerial vehicles, must process sensor data as it arrives (in real-time) using onboard hardware (at- the-edge) in order to continually inform systems such as navigation. Additionally, detection must be often achieved on platforms with limited Size, Weight, and Power (SWaP) since advanced computer hardware may not be possible to place nearby the sensor. This presents a unique challenge: how can we best provide accurate real-time object detection on limited SWaP systems while maintaining low power and computational cost? A widespread approach for detection is using deep-learning. An object de-tection network is trained on a labeled dataset of images containing known objects and their location. After training, the network may be used to infer on new data, providing both bounding boxes and class identifiers for each box. Popular single- shot detectors have been demonstrated to achieve real-time performance on some systems while having acceptable detection accuracy. An ensemble is a system comprised of several detectors. In theory, detectorswith architectural differences, ones trained on different data, or detectors given different augmented data at inference time will discover and detect different features of an image. Unifying the results of several different detectors has been demonstrated to improve the detection performance of the ensemble compared to the performance of any component network at the expense of additional computational cost. Further, systems using an ensemble of detectors have been shown to be good solutions to object detection problems in limited SWaP applications such as surveillance and search-and-rescue. Unlike tasks such as classification, where the output of a network describes theentire input, object detection is concerned both with localization and classification of one or multiple objects in an image. Two different bounding boxes for partially occluded objects may overlap, or highly similar bounding boxes may describe the same object. As a result, unifying the results of object detector networks is far more difficult than unifying classifier networks. Current works typically accomplish this by applying strategies that iteratively combine bounding boxes by overlap. However, little comparative study has been done to determine the effectiveness of these approaches. This thesis builds on current methods of ensembling object detector networksusing novel approaches to combine bounding boxes. We first introduce current methods for ensembling and a dataflow-based framework for efficient, scalable com- putation of ensembles of detectors. We then contribute a novel method for ensem- bling and implement a practical system for scalable detection using an elastic neural network.Item Generalizable Depression Detection and Severity Prediction Using Articulatory Representations of Speech(2022) Seneviratne, Nadee; Espy-Wilson, Carol; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Major Depressive Disorder (MDD) is a mental health disorder that has taken a massive toll on society both socially and financially. Timely diagnosis of MDD is extremely crucial to minimize serious consequences such as suicide. Hence automated solutions that can reliably detect and predict the severity of MDD can play a pivotal role in assisting healthcare professionals in providing timely treatments. MDD is known to affect speech. Leveraging on the changes in speech characteristics that occur due to depression, a lot of vocal biomarkers are being developed to detect depression. However, the study into changes in articulatory coordination associated with depression is under-explored. Speech articulation is a complex activity that requires finely timed coordination across articulators. In a depressed state involving psychomotor slowing, this coordination changes and in turn modifies the perceived speech signal. In this work, we use a direct representation of articulation known as vocal tract variables (TVs) to capture the coordination between articulatory gestures. TVs define the constriction degree and location of articulators (tongue, jaw, lips, velum and glottis). Previously, correlation structure of formants or mel-frequency cepstral coefficients (MFCCs) were used as a proxy for underlying articulatory coordination. We compute the articulatory coordination features (ACFs) which provide details about the correlation among time-series data at different time delays and are therefore rich with information about the underlying coordination level of speech production. Using the rank-ordered eigenspectra obtained from TV based ACFs, we show that depressed speech depicts simpler coordination relative to the speech of the same subjects when in remission which is inline with previous findings. By conducting a preliminary study using a small subset of speech from subjects who transitioned from being severely depressed to being in remission, we show that TV based ACFs outperform formant based ACFs in binary depression classification. We show that depressed speech has reduced variability in terms of reduced coarticulation and undershoot. To validate this, we present a comprehensive acoustic analysis and results of a speech-in-noise perception study to compare the intelligibility of depressed speech relative to not-depressed speech. Our results indicate that depressed speech is at least as intelligible as not-depressed speech. The next stage of our work focuses on developing deep learning based models using TV based ACFs to detect depression and attempts to overcome the limitations in existing work. We combine two speech depression databases with different characteristics which helps to increase the generalizability which is a key objective of this research. Moreover, we segment audio recordings prior to feature extraction to obtain data volumes required to train deep neural networks. We reduce the dimensionality of conventional stacked ACFs of multiple delay scales by using refined ACFs which are carefully curated to remove redundancies and using the strengths of dilated Convolutional Neural Networks. We show that models trained on TV based ACFs are more generalizable compared to its proxy counterparts. Then we develop a multi-stage convolutional recurrent neural network that performs classification at the session-level. We derive the constraints under which this segment-to-session level approach could be used to boost the classification performance. We extend our models to perform depression severity level classification. The TV based ACFs outperform other feature sets in this task as well. The language pattern and semantics can reveal vital information regarding a person's mental state. We develop a multimodal depression classifier which incorporates TV based ACFs and hierarchical attention based text embeddings. The fusion strategy of the proposed architecture enables segmenting data from different modalities independently (overlapping segments for audio and sentences for text), in the most optimal way for each modality, when performing segment-to-session level classification. The multimodal classifier clearly performs better than the unimodal classifiers. Finally, we develop a multimodal system to predict the depression severity score, which is a more challenging regression problem due to the quasi-numerical nature of the scores. Multimodal regressor achieves the lowest root mean squared error showing the synergies of combining multiple modalities such as audio and text. We perform an exhaustive error analysis that reveals potential improvements to be made in the future. The work in this dissertation takes a step forward towards the betterment of humanity by exploring the development of technologies to improve the performance of speech based depression assessment, utilizing the strengths of the ACFs derived from direct articulatory representations.Item Wavefront Shaping in a Complex Reverberant Environment with a Binary Tunable Metasurface(2021) Frazier, Benjamin West; Antonsen, Thomas M; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Electromagnetic environments are becoming increasingly complex and congested, creating a growing challenge for systems that rely on electromagnetic waves for communication, sensing, or imaging. The use of intelligent, reconfigurable metasurfaces provides a potential means for achieving a radio environment that is capable of directing propagating waves to optimize wireless channels on-demand, ensuring reliable operation and protecting sensitive electronic components. The capability to isolate or reject unwanted signals in order to mitigate vulnerabilities is critical for any practical application. In the first part of this dissertation, I describe the use of a binary programmable metasurface to (i) control the spatial degrees of freedom for waves propagating inside an electromagnetic cavity and demonstrate the ability to create nulls in the transmission coefficient between selected ports; and (ii) create the conditions for coherent perfect absorption. Both objectives are performed at arbitrary frequencies. In the first case a novel and effective stochastic optimization algorithm is presented that selectively generates coldspots over a single frequency band or simultaneously over multiple frequency bands. I show that this algorithm is successful with multiple input port configurations and varying optimization bandwidths. In the second case I establish how this technique can be used to establish a multi-port coherent perfect absorption state for the cavity. In the second part of this dissertation, I introduce a technique that combines a deep learning network with a binary programmable metasurface to shape waves in complex electromagnetic environments, in particular ones where there is no direct line-of-sight. I applied this technique for wavefront reconstruction and accurately determined metasurface configurations based on measured system scattering responses in a chaotic microwave cavity. The state of the metasurface that realizes desired electromagnetic wave field distribution properties was successfully determined even in cases previously unseen by the deep learning algorithm. My technique is enabled by the reverberant nature of the cavity, and is effective with a metasurface that covers only ~1.5% of the total cavity surface area.Item Impact Of Semantics, Physics And Adversarial Mechanisms In Deep Learning(2020) Kavalerov, Ilya; Chellappa, Rama; Czaja, Wojciech; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has greatly advanced the performance of algorithms on tasks such as image classification, speech enhancement, sound separation, and generative image models. However many current popular systems are driven by empirical rules that do not fully exploit the underlying physics of the data. Many speech and audio systems fix STFT preprocessing before their networks. Hyperspectral Image (HSI) methods often don't deliberately consider the spectral spatial trade off that is not present in normal images. Generative Adversarial Networks (GANs) that learn a generative distribution of images don't prioritize semantic labels of the training data. To meet these opportunities we propose to alter known deep learning methods to be more dependent on the semantic and physical underpinnings of the data to create better performing and more robust algorithms for sound separation and classification, image generation, and HSI segmentation. Our approaches take inspiration from from Harmonic Analysis, SVMs, and classical statistical detection theory, and further the state-of-the art in source separation, defense against audio adversarial attacks, HSI classification, and GANs. Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. We compare using a short-time Fourier transform (STFT) with a learnable basis at variable window sizes for the feature extraction stage of our sound separation network. We also compare the robustness to adversarial examples of speech classification networks that similarly hybridize established Time-frequency (TF) methods with learnable filter weights. We analyze HSI images for material classification. For hyperspectral image cubes TF methods decompose spectra into multi-spectral bands, while Neural Networks (NNs) incorporate spatial information across scales and model multiple levels of dependencies between spectral features. The Fourier scattering transform is an amalgamation of time-frequency representations with neural network architectures. We propose and test a three dimensional Fourier scattering method on hyperspectral datasets, and present results that indicate that the Fourier scattering transform is highly effective at representing spectral data when compared with other state-of-the-art methods. We study the spectral-spatial trade-off that our Scattering approach allows.We also use a similar multi-scale approach to develop a defense against audio adversarial attacks. We propose a unification of a computational model of speech processing in the brain with commercial wake-word networks to create a cortical network, and show that it can increase resistance to adversarial noise without a degradation in performance. Generative Adversarial Networks are an attractive approach to constructing generative models that mimic a target distribution, and typically use conditional information (cGANs) such as class labels to guide the training of the discriminator and the generator. We propose a loss that ensures generator updates are always class specific, rather than training a function that measures the information theoretic distance between the generative distribution and one target distribution, we generalize the successful hinge-loss that has become an essential ingredient of many GANs to the multi-class setting and use it to train a single generator classifier pair. While the canonical hinge loss made generator updates according to a class agnostic margin a real/fake discriminator learned, our multi-class hinge-loss GAN updates the generator according to many classification margins. With this modification, we are able to accelerate training and achieve state of the art Inception and FID scores on Imagenet128. We study the trade-off between class fidelity and overall diversity of generated images, and show modifications of our method can prioritize either each during training. We show that there is a limit to how closely classification and discrimination can be combined while maintaining sample diversity with some theoretical results on K+1 GANs.Item DEEP LEARNING FOR FORENSICS(2020) Zhou, Peng; Davis, Larry; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The advent of media sharing platforms and the easy availability of advanced photo or video editing software have resulted in a large quantity of manipulated images and videos being shared on the internet. While the intent behind such manipulations varies widely, concerns on the spread of fake news and misinformation is growing. Therefore, detecting manipulation has become an emerging necessity. Different from traditional classification, semantic object detection or segmentation, manipulation detection/classification pays more attention to low-level tampering artifacts than to semantic content. The main challenges in this problem include (a) investigating features to reveal tampering artifacts, (b) developing generic models which are robust to a large scale of post-processing methods, (c) applying algorithms to higher resolution in real scenarios and (d) handling the new emerging manipulation techniques. In this dissertation, we propose approaches to tackling these challenges. Manipulation detection utilizes both low-level tamper artifacts and semantic contents, suggesting that richer features needed to be harnessed to reveal more evidence. To learn rich features, we propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression. Additionally, to extend manipulation detection from image to video, we introduce VIDNet, Video Inpainting Detection Network, which contains an encoder-decoder architecture with a quad-directional local attention module. To reveal artifacts encoded in compression, VIDNet additionally takes in Error Level Analysis (ELA) frames to augment RGB frames, producing multimodal features at different levels with an encoder. Besides, to improve the generalization of manipulation detection model, we introduce a manipulated image generation process that creates true positives using currently available datasets. Drawing from traditional work on image blending, we propose a novel generator for creating such examples. In addition, we also propose to further create examples that force the algorithm to focus on boundary artifacts during training. Extensive experimental results validate our proposal. Furthermore, to apply deep learning models to high resolution scenarios efficiently, we treat the problem as a mask refinement given a coarse low resolution prediction. We propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. Extensive experiments on both the public and a newly created high resolution dataset strongly validate our approach. Finally, to handle new emerging manipulation techniques while preserving performance on learned manipulation, we investigate incremental learning. We propose a multi-model and multi-level knowledge distillation strategy to preserve performance on old categories while training on new categories. Experiments on standard incremental learning benchmarks show that our method improves the overall performance over standard distillation techniques.Item FACIAL EXPRESSION RECOGNITION AND EDITING WITH LIMITED DATA(2020) Ding, Hui; Chellappa,, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Over the past five years, methods based on deep features have taken over the computer vision field. While dramatic performance improvements have been achieved for tasks such as face detection and verification, these methods usually need large amounts of annotated data. In practice, not all computer vision tasks have access to large amounts of annotated data. Facial expression analysis is such a task. In this dissertation, we focus on facial expression recognition and editing problems with small datasets. In addition, to cope with challenging conditions like pose and occlusion, we also study unaligned facial attribute detection and occluded expression recognition problems. This dissertation has been divided into four parts. In the first part, we present FaceNet2ExpNet, a novel idea to train a light-weight and high accuracy classification model for expression recognition with small datasets. We first propose a new distribution function to model the high-level neurons of the expression network. Based on this, a two-stage training algorithm is carefully designed. In the pre-training stage, we train the convolutional layers of the expression net, regularized by the face net; In the refining stage, we append fully-connected layers to the pre-trained convolutional layers and train the whole network jointly. Visualization shows that the model trained with our method captures improved high-level expression semantics. Evaluations on four public expression databases demonstrate that our method achieves better results than state-of-the-art. In the second part, we focus on robust facial expression recognition under occlusion and propose a landmark-guided attention branch to find and discard corrupted feature elements from recognition. An attention map is first generated to indicate if a specific facial part is occluded and guide our model to attend to the non-occluded regions. To further increase robustness, we propose a facial region branch to partition the feature maps into non-overlapping facial blocks and enforce each block to predict the expression independently. Depending on the synergistic effect of the two branches, our occlusion adaptive deep network significantly outperforms state-of-the-art methods on two challenging in-the-wild benchmark datasets and three real-world occluded expression datasets. In the third part, we propose a cascade network that simultaneously learns to localize face regions specific to attributes and performs attribute classification without alignment. First, a weakly-supervised face region localization network is designed to automatically detect regions (or parts) specific to attributes. Then multiple part-based networks and a whole-image-based network are separately constructed and combined together by the region switch layer and attribute relation layer for final attribute classification. A multi-net learning method and hint-based model compression are further proposed to get an effective localization model and a compact classification model, respectively. Our approach achieves significantly better performance than state-of-the-art methods on unaligned CelebA dataset, reducing the classification error by 30.9% In the final part of this dissertation, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN.Item Deep Learning with Constraints and Priors for Improved Subject Clustering, Medical Imaging, and Robust Inference(2020) Lin, Wei-An; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep neural networks (DNNs) have achieved significant success in several fields including computer vision, natural language processing, and robot control. The common philosophy behind these success is the use of large amount of annotated data and end-to-end networks with task-specific constraints and priors implicitly incorporated into the trained model without the need for careful feature engineering. However, DNNs are shown to be vulnerable to distribution shifts and adversarial perturbations, which indicates that such implicit priors and constraints are not sufficient for real world applications. In this dissertation, we target three applications and design task-specific constraints and priors for improved performance of deep neural networks. We first study the problem of subject clustering, the task of grouping face images of the same person together. We propose to utilize the prior structure in the feature space of DNNs trained for face identification to design a novel clustering algorithm. Specifically, the clustering algorithm exploits the local neighborhood structure of deep representations by exemplar-based learning based on k-nearest neighbors (k-NN). Extensive experiments show promising results for grouping face images according to subject identity. As an example, we apply the proposed clustering algorithm to automatically curate a large-scale face dataset with noisy labels and show that the performance of face recognition DNNs can be significantly improved by training on the curated dataset. Furthermore, we empirically find that the k-NN rule does not capture proper local structures for deep representations when each subject has very few face images. We then propose to improve upon the exemplar-based approach by a density-aware similarity measure and theoretically show its asymptotic convergence to a density estimator. We conduct experiments on challenging face datasets that show promising results. Second, we study the problem of metal artifact reduction in computed tomography (CT). Unlike typical image restoration tasks such as super-resolution and denoising, metal artifacts in CT images are structured and non-local. Conventional DNNs do not generalize well when metal implants with unseen shapes are presented. We find that the imaging process of CT induces a data consistency prior that can be exploited for image enhancement. Based on this observation, we propose a dual-domain learning approach to CT metal artifact reduction. We design and implement a novel Radon inversion layer that allows gradients in the image domain to be backpropagated to the projection domain. Experiments conducted on both simulated datasets and clinical datasets show promising results. Compared to conventional DNN-based models, the proposed dual-domain approach leads to impressive metal artifact reduction and has improved generalization capability. Finally, we study the problem of robust classification. In the past few years, the vulnerability of DNNs to small imperceptible perturbations has been widely studied, which raises concerns about the security and robustness of DNNs against possible threat models. To defend against threat models, Samangoui et al. proposed DefenseGAN, a preprocessing approach which removes adversarial perturbations by projecting the input images onto the learned data prior. However, the projection operation in DefenseGAN is time-consuming and may not yield proper reconstruction when images have complicated textures. We propose an inversion network to constrain the initial estimates of the latent code for input images. With the proposed constraint, the number of optimization steps in DefenseGAN can be reduced while achieving improved accuracy and robustness. Furthermore, we conduct empirical studies on attack methods that have claimed to break DefenseGAN, which shows that on-manifold robustness might be the key factor for ensuring adversarial robustness.Item Augmented Deep Representations for Unconstrained Still/Video-based Face Recognition(2019) Zheng, Jingxiao; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face recognition is one of the active areas of research in computer vision and biometrics. Many approaches have been proposed in the literature that demonstrate impressive performance, especially those based on deep learning. However, unconstrained face recognition with large pose, illumination, occlusion and other variations is still an unsolved problem. Unconstrained video-based face recognition is even more challenging due to the large volume of data to be processed, lack of labeled training data and significant intra/inter-video variations on scene, blur, video quality, etc. Although Deep Convolutional Neural Networks (DCNNs) have provided discriminant representations for faces and achieved performance surpassing humans in controlled scenarios, modifications are necessary for face recognition in unconstrained conditions. In this dissertation, we propose several methods that improve unconstrained face recognition performance by augmenting the representation provided by the deep networks using correlation or contextual information in the data. For unconstrained still face recognition, we present an encoding approach to combine the Fisher vector (FV) encoding and DCNN representations, which is called FV-DCNN. The feature maps from the last convolutional layer in the deep network are encoded by FV into a robust representation, which utilizes the correlation between facial parts within each face. A VLAD-based encoding method called VLAD-DCNN is also proposed as an extension. Extensive evaluations on three challenging face recognition datasets show that the proposed FV-DCNN and VLAD-DCNN perform comparable to or better than many state-of-the-art face verification methods. For the more challenging video-based face recognition task, we first propose an automatic system and model the video-to-video similarity as subspace-to-subspace similarity, where the subspaces characterize the correlation between deep representations of faces in videos. In the system, a quality-aware subspace-to-subspace similarity is introduced, where subspaces are learned using quality-aware principal component analysis. Subspaces along with quality-aware exemplars of templates are used to produce the similarity scores between video pairs by a quality-aware principal angle-based subspace-to-subspace similarity metric. The method is evaluated on four video datasets. The experimental results demonstrate the superior performance of the proposed method. To utilize the temporal information in videos, a hybrid dictionary learning method is also proposed for video-based face recognition. The proposed unsupervised approach effectively models the temporal correlation between deep representations of video faces using dynamical dictionaries. A practical iterative optimization algorithm is introduced to learn the dynamical dictionary. Experiments on three video-based face recognition datasets demonstrate that the proposed method can effectively learn robust and discriminative representation for videos and improve the face recognition performance. Finally, to leverage contextual information in videos, we present the Uncertainty-Gated Graph (UGG) for unconstrained video-based face recognition. It utilizes contextual information between faces by conducting graph-based identity propagation between sample tracklets, where identity information are initialized by the deep representations of video faces. UGG explicitly models the uncertainty of the contextual connections between tracklets by adaptively updating the weights of the edge gates according to the identity distributions of the nodes during inference. UGG is a generic graphical model that can be applied at only inference time or with end-to-end training. We demonstrate the effectiveness of UGG with state-of-the-art results on the recently released challenging Cast Search in Movies and IARPA Janus Surveillance Video Benchmark datasets.