Theses and Dissertations from UMD

Permanent URI for this communityhttp://hdl.handle.net/1903/2

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 5 of 5
  • Thumbnail Image
    Item
    Leveraging Deep Generative Models for Estimation and Recognition
    (2023) PNVR, Koutilya; Jacobs, David W.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Generative models are a class of statistical models that estimate the joint probability distribution on a given observed variable and a target variable. In computer vision, generative models are typically used to model the joint probability distribution of a set of real image samples assumed to be on a complex high-dimensional image manifold. The recently proposed deep generative architectures such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models (DMs) were shown to generate photo-realistic images of human faces and other objects. These generative models also became popular for other generative tasks such as image editing, text-to-image, etc. As appealing as the perceptual quality of the generated images has become, the use of generative models for discriminative tasks such as visual recognition or geometry estimation has not been well studied. Moreover, with different kinds of powerful generative models getting popular lately, it's important to study their significance in other areas of computer vision. In this dissertation, we demonstrate the advantages of using generative models for applications that go beyond just photo-realistic image generation: Unsupervised Domain Adaptation (UDA) between synthetic and real datasets for geometry estimation; Text-based image segmentation for recognition. In the first half of the dissertation, we propose a novel generative-based UDA method for combining synthetic and real images when training networks to determine geometric information from a single image. Specifically, we use a GAN model to map both synthetic and real domains into a shared image space by translating just the domain-specific task-related information from respective domains. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Compared to previous approaches, we demonstrate an improved domain gap reduction and much better generalization between synthetic and real data for geometry estimation tasks such as monocular depth estimation and face normal estimation. In the second half of the dissertation, we showcase the power of a recent class of generative models for improving an important recognition task: text-based image segmentation. Specifically, large-scale pre-training tasks like image classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foundation models built using text-based latent diffusion techniques may learn semantic boundaries. This is because they must synthesize intricate details about all objects in an image based on a text description. Therefore, we present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
  • Thumbnail Image
    Item
    Impact Of Semantics, Physics And Adversarial Mechanisms In Deep Learning
    (2020) Kavalerov, Ilya; Chellappa, Rama; Czaja, Wojciech; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Deep learning has greatly advanced the performance of algorithms on tasks such as image classification, speech enhancement, sound separation, and generative image models. However many current popular systems are driven by empirical rules that do not fully exploit the underlying physics of the data. Many speech and audio systems fix STFT preprocessing before their networks. Hyperspectral Image (HSI) methods often don't deliberately consider the spectral spatial trade off that is not present in normal images. Generative Adversarial Networks (GANs) that learn a generative distribution of images don't prioritize semantic labels of the training data. To meet these opportunities we propose to alter known deep learning methods to be more dependent on the semantic and physical underpinnings of the data to create better performing and more robust algorithms for sound separation and classification, image generation, and HSI segmentation. Our approaches take inspiration from from Harmonic Analysis, SVMs, and classical statistical detection theory, and further the state-of-the art in source separation, defense against audio adversarial attacks, HSI classification, and GANs. Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. We compare using a short-time Fourier transform (STFT) with a learnable basis at variable window sizes for the feature extraction stage of our sound separation network. We also compare the robustness to adversarial examples of speech classification networks that similarly hybridize established Time-frequency (TF) methods with learnable filter weights. We analyze HSI images for material classification. For hyperspectral image cubes TF methods decompose spectra into multi-spectral bands, while Neural Networks (NNs) incorporate spatial information across scales and model multiple levels of dependencies between spectral features. The Fourier scattering transform is an amalgamation of time-frequency representations with neural network architectures. We propose and test a three dimensional Fourier scattering method on hyperspectral datasets, and present results that indicate that the Fourier scattering transform is highly effective at representing spectral data when compared with other state-of-the-art methods. We study the spectral-spatial trade-off that our Scattering approach allows.We also use a similar multi-scale approach to develop a defense against audio adversarial attacks. We propose a unification of a computational model of speech processing in the brain with commercial wake-word networks to create a cortical network, and show that it can increase resistance to adversarial noise without a degradation in performance. Generative Adversarial Networks are an attractive approach to constructing generative models that mimic a target distribution, and typically use conditional information (cGANs) such as class labels to guide the training of the discriminator and the generator. We propose a loss that ensures generator updates are always class specific, rather than training a function that measures the information theoretic distance between the generative distribution and one target distribution, we generalize the successful hinge-loss that has become an essential ingredient of many GANs to the multi-class setting and use it to train a single generator classifier pair. While the canonical hinge loss made generator updates according to a class agnostic margin a real/fake discriminator learned, our multi-class hinge-loss GAN updates the generator according to many classification margins. With this modification, we are able to accelerate training and achieve state of the art Inception and FID scores on Imagenet128. We study the trade-off between class fidelity and overall diversity of generated images, and show modifications of our method can prioritize either each during training. We show that there is a limit to how closely classification and discrimination can be combined while maintaining sample diversity with some theoretical results on K+1 GANs.
  • Thumbnail Image
    Item
    DEEP INFERENCE ON MULTI-SENSOR DATA
    (2019) Ghosh, Arthita; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Computer vision-based intelligent autonomous systems engage various types of sensors to perceive the world they navigate in. Vision systems perceive their environments through inferences on entities (structures, humans) and their attributes (pose, shape, materials) that are sensed using RGB and Near-InfraRed (NIR) cameras, LAser Detection And Ranging (LADAR), radar and so on. This leads to challenging and interesting problems in efficient data-capture, feature extraction, and attribute estimation, not only for RGB but various other sensors. In some cases, we encounter very limited amounts of labeled training data. In certain other scenarios we have sufficient data, but annotations are unavailable for supervised learning. This dissertation explores two approaches to learning under conditions of minimal to no ground truth. The first approach applies projections on training data that make learning efficient by improving training dynamics. The first and second topics in this dissertation belong to this category. The second approach makes learning without ground-truth possible via knowledge transfer from a labeled source domain to an unlabeled target domain through projections to domain-invariant shared latent spaces. The third and fourth topics in this dissertation belong to this category. For the first topic we study the feasibility and efficacy of identifying shapes in LADAR data in several measurement modes. We present results on efficient parameter learning with less data (for both traditional machine learning as well as deep models) on LADAR images. We use a LADAR apparatus to obtain range information from a 3-D scene by emitting laser beams and collecting the reflected rays from target objects in the region of interest. The Agile Beam LADAR concept makes the measurement and interpretation process more efficient using a software-defined architecture that leverages computational imaging principles. Using these techniques, we show that object identification and scene understanding can be accurately performed in the LADARmeasurement domain thereby rendering the efforts of pixel-based scene reconstruction superfluous. Next, we explore the effectiveness of deep features extracted by Convolutional Neural Networks (CNNs) in the Discrete Cosine Transform (DCT) domain for various image classification tasks such as pedestrian and face detection, material identification and object recognition. We perform the DCT operation on the feature maps generated by convolutional layers in CNNs. We compare the performance of the same network with the same hyper-parameters with or without the DCT step. Our results indicate that a DCT operation incorporated into the network after the first convolution layer can have certain advantages such as convergence over fewer training epochs and sparser weight matrices that are more conducive to pruning and hashing techniques. Next, we present an adversarial deep domain adaptation (ADA)-based approach for training deep neural networks that fit 3Dmeshes on humans in monocular RGB input images. Estimating a 3D mesh from a 2D image is helpful in harvesting complete 3Dinformation about body pose and shape. However, learning such an estimation task in a supervised way is challenging owing to the fact that ground truth 3D mesh parameters for real humans do not exist. We propose a domain adaptation based single-shot (no re-projection, no iterative refinement), end-to-end training approach with joint optimization on real and synthetic images on a shared common task. Through joint inference on real and synthetic data, the network extracts domain invariant features that are further used to estimate the 3D mesh parameters in a single shot with no supervision on real samples. While we compute regression loss on synthetic samples with ground truth mesh parameters, knowledge is transferred from synthetic to real data through ADA without direct ground truth for supervision. Finally, we propose a partially supervised method for satellite image super-resolution by learning a unified representation of samples from different domains (captured by different sensors) in a shared latent space. The training samples are drawn from two datasets which we refer to as source and target domains. The source domain consists of fewer samples which are of higher resolution and contain very detailed and accurate annotations. In contrast, samples from the target domain are low-resolution and available ground truth is sparse. The pipeline consists of a feature extractor and a super-resolving module which are trained end-to-end. Using a deep feature extractor, we jointly learn (on two datasets) a common embedding space for all samples. Partial supervision is available for the samples in the source domain which have high-resolution ground truth. Adversarial supervision is used to successfully super-resolve low-resolution RGB satellite imagery from target domain without direct paired supervision from high resolution counterparts.
  • Thumbnail Image
    Item
    TOWARDS BUILDING GENERALIZABLE SPEECH EMOTION RECOGNITION MODELS
    (2019) Sahu, Saurabh; Espy-Wilson, Carol; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Abstract: Detecting the mental state of a person has implications in psychiatry, medicine, psychology and human-computer interaction systems among others. It includes (but is not limited to) a wide variety of problems such as emotion detection, valence-affect-dominance states prediction, mood detection and detection of clinical depression. In this thesis we focus primarily on emotion recognition. Like any recognition system, building an emotion recognition model consists of the following two steps: 1. Extraction of meaningful features that would help in classification 2. Development of an appropriate classifier Speech data being non-invasive and the ease with which it can be collected has made it a popular candidate for feature extraction. However, an ideal system designed should be agnostic to speaker and channel effects. While feature normalization schemes can counter these problems to some extent, we still see a drastic drop in performance when the training and test data-sets are unmatched. In this dissertation we explore some novel ways towards building models that are more robust to speaker and domain differences. Training discriminative classifiers involves learning a conditional distribution p(y_i|x_i), given a set of feature vectors x_i and the corresponding labels y_i; i=1,...N. For a classifier to be generalizable and not overfit to training data, the resulting conditional distribution p(y_i|x_i) is desired to be smoothly varying over the inputs x_i. Adversarial training procedures enforce this smoothness using manifold regularization techniques. Manifold regularization makes the model’s output distribution more robust to local perturbation added to a datapoint x_i. In the first part of the dissertation, we investigate two training procedures: (i) adversarial training where we determine the perturbation direction based on the given labels for the training data and, (ii) virtual adversarial training where we determine the perturbation direction based only on the output distribution of the training data. We demonstrate the efficacy of adversarial training procedures by performing a k-fold cross validation experiment on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a cross-corpus performance analysis on three separate corpora. We compare their performances to that of a model utilizing other regularization schemes such as L1/L2 and graph based manifold regularization scheme. Results show improvement over a purely supervised approach, as well as better generalization capability to cross-corpus settings. Our second approach to better discriminate between emotions leverages multi-modal learning and automated speech recognition (ASR) systems toward improving the generalizability of an emotion recognition model that requires only speech as input. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by performing a cross-validation analysis on IEMOCAP dataset. Confusion matrices show it’s the valence level detection that is being improved by incorporating textual information. In the second stage of experiments, we used three ASR application programming interfaces (APIs) to get the transcriptions. We compare the performances of multi-modal systems using the ASR transcriptions with each other and with that of one using ground truth transcription. This is followed by a cross-corpus study. In the third part of the study we investigate the generalizability of generative of generative adversarial networks (GANs) based models. GANs have gained a lot of attention from machine learning community due to their ability to learn and mimic an input data distribution. GANs consist of a discriminator and a generator working in tandem playing a min-max game to learn a target underlying data distribution; when fed with data-points sampled from a simpler distribution (like uniform or Gaussian distribution). Once trained, they allow synthetic generation of examples sampled from the target distribution. We investigate the applicability of GANs to get lower dimensional representations from the higher dimensional feature vectors pertinent for emotion recognition. We also investigate their ability to generate synthetic higher dimensional feature vectors using points sampled from a lower dimensional prior. Specifically, we investigate two set ups: (i) when the lower dimensional prior from which synthetic feature vectors are generated is pre-defined, (ii) when the distribution of lower dimensional prior is learned from training data. We define the metrics that we used to measure and analyze the performance of these generative models in different train/test conditions. We perform cross validation analysis followed by a cross-corpus study. Finally we make an attempt towards understanding the relation between two different sub-problems encompassed under mental state detection namely depression detection and emotion recognition. We propose approaches that can be investigated to build better depression detection models by leveraging our ability to recognize emotions accurately.
  • Thumbnail Image
    Item
    Machine Learning of Facial Attributes Using Explainable, Secure and Generative Adversarial Networks
    (2018) Samangouei, Pouya; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    "Attributes" are referred to abstractions that humans use to group entities and phenomena that have a common characteristic. In machine learning (ML), attributes are fundamental because they bridge the semantic gap between humans and ML systems. Thus, researchers have been using this concept to transform complicated ML systems into interactive ones. However, training the attribute detectors which are central to attribute-based ML systems can still be challenging. It might be infeasible to gather attribute labels for rare combinations to cover all the corner cases, which can result in weak detectors. Also, it is not clear how to fill in the semantic gap with attribute detectors themselves. Finally, it is not obvious how to interpret the detectors' outputs in the presence of adversarial noise. First, we investigate the effectiveness of attributes for bridging the semantic gap in complicated ML systems. We turn a system that does continuous authentication of human faces on mobile phones into an interactive attribute-based one. We employ deep multi-task learning in conjunction with multi-view classification using facial parts to tackle this problem. We show how the proposed system decomposition enables efficient deployment of deep networks for authentication on mobile phones with limited resources. Next, we seek to improve the attribute detectors by using conditional image synthesis. We take a generative modeling approach for manipulating the semantics of a given image to provide novel examples. Previous works condition the generation process on binary attribute existence values. We take this type of approaches one step further by modeling each attribute as a distributed representation in a vector space. These representations allow us to not only toggle the presence of attributes but to transfer an attribute style from one image to the other. Furthermore, we show diverse image generation from the same set of conditions, which was not possible using existing methods with a single dimension per attribute. We then investigate filling in the semantic gap between humans and attribute classifiers by proposing a new way to explain the pre-trained attribute detectors. We use adversarial training in conjunction with an encoder-decoder model to learn the behavior of binary attribute classifiers. We show that after our proposed model is trained, one can see which areas of the image contribute to the presence/absence of the target attribute, and also how to change image pixels in those areas so that the attribute classifier decision changes in a consistent way with human perception. Finally, we focus on protecting the attribute models from un-interpretable behaviors provoked by adversarial perturbations. These behaviors create an inexplainable semantic gap since they are visually unnoticeable. We propose a method based on generative adversarial networks to alleviate this issue. We learn the training data distribution that is used to train the core classifier and use it to detect and denoise test samples. We show that the method is effective for defending facial attribute detectors.