Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
7 results
Search Results
Item FOUNDATIONS OF TRUSTWORTHY DEEP LEARNING: FAIRNESS, ROBUSTNESS, AND EXPLAINABILITY(2024) Nanda, Vedant; Dickerson, John; Gummadi, Krishna; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep Learning (DL) models, especially with the rise of the so-called foundation models, are increasingly used in real-world applications either as autonomous systems (\eg~facial recognition), as decision aids (\eg~medical imaging, writing assistants), and even to generate novel content (\eg~chatbots, image generators). This naturally results in concerns about the trustworthiness of these systems, for example, do the models systematically perform worse for certain subgroups? Are the outputs of these models reliable under perturbations to the inputs? This thesis aims to strengthen the foundations of DL models, so they can be trusted in deployment. I will cover three important aspects of trust: fairness, robustness, and explainability. I will argue that we need to expand the scope of each of these aspects when applying them to DL models and carefully consider possible tradeoffs between these desirable but sometimes conflicting notions of trust. Traditionally the fairness community has worked on mitigating biases in classical models such as Support Vector Machines (SVMs) and logistic regression. However, a lot of real-world applications where bias shows up in a myriad of ways involve much more complicated DL models. In the first part, I will present two works that show how thinking about fairness for deep learning (DL) introduces new challenges, especially due to their overparametrized nature and susceptibility to adversarial attacks. Robustness literature has focused largely on measuring the invariance of models to carefully constructed (adversarial attacks) or natural (distribution shifts) noise. In the second part, I will argue that to get truly robust models, we must focus on a more general notion of robustness: measuring the alignment of invariances of DL models with other models of perception such as humans. I will present two works that measure shared invariances between (1) DL models and humans, and (2) between DL models. Such measurements of robustness provide a measure of \textit{relative robustness}, through which we can better understand the failure modes of DL models and work towards building truly robust systems. Finally, in the third part, I will show how even a small subset of randomly chosen neurons from a pre-trained representation can transfer very well to downstream tasks. We call this phenomenon \textit{diffused redundancy}, which we observe in a variety of pre-trained representations. This finding challenges existing beliefs in the explainability literature that claim individual neurons learn disjoint semantically meaningful concepts.Item IMPROVING MODEL AND DATA EFFICIENCY FOR DEEP LEARNING(2023) Ni, Renkun; Goldstein, Tom; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has achieved or even surpassed human-level performance in a wide range of challenging tasks encompassing computer vision, natural language processing, and speech recognition. Nevertheless, such achievements are predominantly derived from training huge models (i.e., billions of parameters) on numerous labeled examples, which requires considerable computation resources and expensive data collection costs. Various studies have strived to enhance efficiency in these domains. In terms of model efficiency, remarkable advancements have been made to accelerate the training and inference by methods such as quantization and pruning. Regarding data efficiency, few-shot learning, semi-supervised learning, and self-supervised learning have gathered more attention due to their abilities to learn feature representations with few labeled examples or even without human supervision. This dissertation introduces several improvements and provides an in-depth analysis of these methodologies, aiming to address the computational challenges and augment the efficiency of deep learning models, especially in computer vision. In addressing model efficiency, we explore the potential for improvement in both the training and inference phases of deep learning processes. For model inference acceleration, we investigate the challenges of using extremely low-resolution arithmetic in quantization methods, where integer overflows frequently happen and the models are sensitive to these overflows. To address this issue, we introduce a novel module, designed to emulate the “wrap-around” property of integer overflow, which maintains comparable performance with 8-bit low-resolution accumulators. In addition, to scale inferences of Vision Transformers on mobile devices, we propose an efficient and flexible local self-attention mechanism optimized directly on mobile devices that achieves comparable performance to global attention while significantly reducing the on-device latency, especially for high-resolution tasks. Besides the computational costs, training deep neural networks consumes a large amount of memory which is another bottleneck to applying model training on edge devices. To improve the memory efficiency of training deep networks on resource-limited devices, we propose a quantization aware training framework for federated learning where only the quantized model is distributed and trained on the client devices. In the realm of label efficiency, we first develop a better understanding of the models trained by meta-learning, which has a unique training pipeline, for few-shot classification tasks. In addition, a comprehensive analysis has been conducted to integrate data augmentation strategies into the meta-learning pipeline, leading to Meta-MaxUp, a novel data augmentation technique for meta-learning, demonstrating enhanced few-shot performance across various benchmarks. Beyond few-shot learning, the research explores the application of meta-learning methods in the context of self-supervised learning. We discuss the close relationship between meta-learning and contrastive learning, a method that achieves excellent results in self-supervised learning, under a certain task distribution.Item ROBUSTNESS AND UNDERSTANDABILITY OF DEEP MODELS(2022) Ghiasi, Mohammad Amin; Goldstein, Thomas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has made a considerable leap in the past few decades, from promising models for solving various problems to becoming state-of-the-art. However, unlike classical machine learning models, it is sometimes difficult to explain why and how deep learning models make decisions. It is also interesting that their performance can drop with small amounts of noise. In short, deep learning models are well-performing, easily corrupted, hard-to-understand models that beat human beings in many tasks. Consequently, improving these deep models requires a deep understanding. While deep learning models usually generalize well on unseen data, adding negligible amounts of noise to their input can flip their decision. This interesting phenomenon is known as "adversarial attacks." In this thesis, we study several defense methods against such adversarial attacks. More specifically, we focus on defense methods that, unlike traditional methods, use less computation or fewer training examples. We also show that despite the improvements in adversarial defenses, even provable certified defenses can be broken. Moreover, we revisit regularization to improve adversarial robustness. Over the past years, many techniques have been developed for understanding and explaining how deep neural networks make a decision. This thesis introduces a new method for studying the building blocks of neural networks' decisions. First, we introduce the Plug-In Inversion, a new method for inverting and visualizing deep neural network architectures, including Vision Transformers. Then we study the features a ViT learns to make a decision. We compare these features when the network trains on labeled data versus when it uses a language model's supervision for training, such as in CLIP. Last, we introduce feature sonification, which borrows feature visualization techniques to study models trained for speech recognition (non-vision) tasks.Item Enhancing Visual and Gestural Fidelity for Effective Virtual Environments(2020) Meng, Xiaoxu; Varshney, Amitabh; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)A challenge for the virtual reality (VR) industry is facing is that VR is not immersive enough to make people feel a genuine sense of presence: the low frame rate leads to dizziness and the lack of human body visualization limits the human-computer interaction. In this dissertation, I present our research on enhancing visual and gestural fidelity in the virtual environment. First, I present a new foveated rendering technique: Kernel Foveated Rendering (KFR), which parameterizes foveated rendering by embedding polynomial kernel functions in log-polar space. This GPU-driven technique uses parameterized foveation that mimics the distribution of photoreceptors in the human retina. I present a two-pass kernel foveated rendering pipeline that maps well onto modern GPUs. I have carried out user studies to empirically identify the KFR parameters and have observed a 2.8x-3.2x speedup in rendering on 4K displays. Second, I explore the rendering acceleration through foveation for 4D light fields, which captures both the spatial and angular rays, thus enabling free-viewpoint rendering and custom selection of the focal plane. I optimize the KFR algorithm by adjusting the weight of each slice in the light field, so that it automatically selects the optimal foveation parameters for different images according to the gaze position. I have validated our approach on the rendering of light fields by carrying out both quantitative experiments and user studies. Our method achieves speedups of 3.47x-7.28x for different levels of foveation and different rendering resolutions. Thirdly, I present a simple yet effective technique for further reducing the cost of foveated rendering by leveraging ocular dominance - the tendency of the human visual system to prefer scene perception from one eye over the other. Our new approach, eye-dominance-guided foveated rendering (EFR), renders the scene at a lower foveation level (with higher detail) for the dominant eye than the non-dominant eye. Compared with traditional foveated rendering, EFR can be expected to provide superior rendering performance while preserving the same level of perceived visual quality. Finally, I present an approach to use an end-to-end convolutional neural network, which consists of a concatenation of an encoder and a decoder, to reconstruct a 3D model of a human hand from a single RGB image. Previous research work on hand mesh reconstruction suffers from the lack of training data. To train networks with full supervision, we fit a parametric hand model to 3D annotations, and we train the networks with the RGB image with the fitted parametric model as the supervision. Our approach leads to significantly improved quality compared to state-of-the-art hand mesh reconstruction techniques.Item Modeling Deep Context in Spatial and Temporal Domain(2018) Dai, Xiyang; Davis, Larry S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Context has been one of the most important aspects in computer vision researches because it provides useful guidance to solve variant tasks in both spatial and temporal domain. As the recent rise of deep learning methods, deep networks have shown impressive performances on many computer vision tasks. Model deep context explicitly and implicitly in deep networks can further boost the effectiveness and efficiency of deep models. In spatial domain, implicitly model context can be useful to learn discriminative texture representations. We present an effective deep fusion architecture to capture both the second order and first older statistics of texture features; Meanwhile, explicitly model context can also be important to challenging task such as fine-grained classification. We then present a deep multi-task network that explicitly captures geometry constraints by simultaneously conducting fine-grained classification and key-point localization. In temporal domain, explicitly model context can be crucial to activity recognition and localization. We present a temporal context network to explicitly capture relative context around a proposal, which samples two temporal scales pair-wisely for precise temporal localization of human activities; Meanwhile, implicitly model context can lead to better network architecture for video applications. We then present a temporal aggregation network that learns a deep hierarchical representation for capturing temporal consistency. Finally, we conduct research on jointly modeling context in both spatial and temporal domain for human action understanding, which requires to predict where, when and what a human action happens in a crowd scene. We present a decoupled framework that has dedicated branches for spatial localization and temporal recognition. Contexts in spatial and temporal branches are modeled explicitly and fused together later to generate final predictions.Item Discourse-Level Language Understanding with Deep Learning(2017) Iyyer, Mohit Nagaraja; Boyd-Graber, Jordan; Daumé, Hal; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Designing computational models that can understand language at a human level is a foundational goal in the field of natural language processing (NLP). Given a sentence, machines are capable of translating it into many different languages, generating a corresponding syntactic parse tree, marking words that refer to people or places, and much more. These tasks are solved by statistical machine learning algorithms, which leverage patterns in large datasets to build predictive models. Many recent advances in NLP are due to deep learning models (parameterized as neural networks), which bypass user-specified features in favor of building representations of language directly from the text. Despite many deep learning-fueled advances at the word and sentence level, however, computers still struggle to understand high-level discourse structure in language, or the way in which authors combine and order different units of text (e.g., sentences, paragraphs, chapters) to express a coherent message or narrative. Part of the reason is data-related, as there are no existing datasets for many contextual language-based problems, and some tasks are too complex to be framed as supervised learning problems; for the latter type, we must either resort to unsupervised learning or devise training objectives that simulate the supervised setting. Another reason is architectural: neural networks designed for sentence-level tasks require additional functionality, interpretability, and efficiency to operate at the discourse level. In this thesis, I design deep learning architectures for three NLP tasks that require integrating information across high-level linguistic context: question answering, fictional relationship understanding, and comic book narrative modeling. While these tasks are very different from each other on the surface, I show that similar neural network modules can be used in each case to form contextual representations.Item ROBUST REPRESENTATIONS FOR UNCONSTRAINED FACE RECOGNITION AND ITS APPLICATIONS(2016) Chen, Jun-Cheng; Chellappa, Rama; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Face identification and verification are important problems in computer vision and have been actively researched for over two decades. There are several applications including mobile authentication, visual surveillance, social network analysis, and video content analysis. Many algorithms have shown to work well on images collected in controlled settings. However, the performance of these algorithms often degrades significantly on images that have large variations in pose, illumination and expression as well as due to aging, cosmetics, and occlusion. How to extract robust and discriminative feature representations from face images/videos is an important problem to achieve good performance in uncontrolled settings. In this dissertation, we present several approaches to extract robust feature representation from a set of images/video frames for face identification and verification problems. We first present a dictionary approach with dense facial landmark features. Each face video is segmented into K partitions first, and the multi-scale features are extracted from patches centered at detected facial landmarks. Then, compact and representative dictionaries are learned from dense features for each partition of a video and then concatenated together into a video dictionary representation for the video. Experiments show that the representation is effective for the unconstrained video-based face identification task. Secondly, we present a landmark-based Fisher vector approach for video-based face verification problems. This approach encodes over-complete local features into a high-dimensional feature representation followed by a learned joint Bayesian metric to project the feature vector into a low-dimensional space and to compute the similarity score. We then present an automated system for face verification which exploits features from deep convolutional neural networks (DCNN) trained using the CASIA-WebFace dataset. Our experimental results show that the DCNN model is able to characterize the face variations from the large-scale source face dataset and generalizes well to another smaller one. Finally, we also demonstrate that the model pre-trained for face identification and verification tasks encodes rich face information which benefit other face-related tasks with scarce annotated training data. We use apparent age estimation as an example and develop a cascade convolutional neural network framework which consists of age group classification and age regression, and a deep networks is fine-tuned using the target data.