UMD Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/3
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
140 results
Search Results
Item Developing and Measuring Latent Constructs in Text(2024) Hoyle, Alexander Miserlis; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Constructs---like inflation, populism, or paranoia---are of fundamental concern to social science. Constructs are the vocabulary over which theory operates, and so a central activity is the development and measurement of latent constructs from observable data. Although the social sciences comprise fields with different epistemological norms, they share a concern for valid operationalizations that transparently map between data and measure. Economists at the US Bureau of Labor Statistics, for example, follow a hundred-page handbook to sample the egg prices that constitute the Consumer Price Index; Clinical psychologists rely on suites of psychometric tests to diagnose schizophrenia. In many fields, this observable data takes the form of language: as a social phenomenon, language data can encode many of the latent social constructs that people care about. Commensurate with both increasing sophistication in language technologies and amounts of available data, there has thus emerged a "text-as-data" paradigm aimed at "amplifying and augmenting" the analyses that compose research. At the same time, Natural Language Processing (NLP), the field from which analysis tools originate, has often remained separate from real-world problems and guiding theories---as least when it comes to social science. Instead, it focuses on atomized tasks under the assumption that progress on low-level language aspects will generalize to higher-level problems that involve overlapping elements. This dissertation focuses on NLP methods and evaluations that facilitate the development and measurement of latent constructs from natural language, while remaining sensitive to social sciences' need for interpretability and validity.Item Representation Learning for Reinforcement Learning: Modeling Non-Gaussian Transition Probabilities with a Wasserstein Critic(2024) Tse, Ryan; Zhang, Kaiqing; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Reinforcement learning algorithms depend on effective state representations when solving complex, high-dimensional environments. Recent methods learn state representations using auxiliary objectives that aim to capture relationships between states that are behaviorally similar, meaning states that lead to similar future outcomes under optimal policies. These methods learn explicit probabilistic state transition models and compute distributional distances between state transition probabilities as part of their measure of behavioral similarity. This thesis presents a novel extension to several of these methods that directly learns the 1-Wasserstein distance between state transition distributions by exploiting the Kantorovich-Rubenstein duality. This method eliminates parametric assumptions about the state transition probabilities while providing a smoother estimator of distributional distances. Empirical evaluation demonstrates improved sample efficiency over some of the original methods and a modest increase in computational cost per sample. The results establish that relaxing theoretical assumptions about state transition modeling leads to more flexible and robust representation learning while maintaining strong performance characteristics.xItem Everything Efficient All at Once - Compressing Data and Deep Networks(2024) Girish, Sharath; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this thesis, we examine the efficiency of deep networks and data, both of which are widely used in various computer vision/AI applications and are ubiquitous in today's information age. As deep networks continue to grow exponentially in size, improving their efficiency in terms of size and computation becomes necessary for deploying across various mobile/small devices with hardware constraints. Data efficiency is also equivalently important due to the memory and network speed bottlenecks when transmitting and storing data which is also being created and transmitted at an exponential rate. In this work, we explore in detail, various approaches to improve the efficiency of deep networks, as well as perform compression of various forms of data content. Efficiency of deep networks involves two major aspects; size, or the memory required to store deep networks on disk, and computation, or the number of operations/time taken to execute the network. The first work analyzes sparsity for computation reduction in the context of vision tasks which involve a large pretraining stage followed by downstream task finetuning. We show that task specific sparse subnetworks are more efficient than generalized sparse subnetworks which are more dense and do not transfer very well. We analyze several behaviors of training sparse networks for various vision tasks. While efficient, this sparsity theoretically focuses on only computation reduction and requires dedicated hardware for practical deployment. We therefore develop a framework for simultaneously reducing size and computation by utilizing a latent quantization-framework along with regularization losses. We compress convolutional networks by more than an order of magnitude in size while maintaining accuracy and speeding up inference without dedicated hardware. Data can take different forms such as audio, language, image, or video. We develop approaches for improving the compression and efficiency of various forms of visual data which take up the bulk of global network traffic as well as storage. This consists of 2D images or videos and, more recently, their 3D equivalents of static/dynamic scenes which are becoming popular for immersive AR/VR applications, scene understanding, 3D-aware generative modeling, and so on. To achieve data compression, we utilize Implicit Neural Representations (INRs) which represent data signals in terms of deep network weights. We transform the problem of data compression into network compression, thereby learning efficient data representations. We first develop an algorithm for compression of 2D videos via autoregressive INRs whose weights are compressed by utilizing the latent-quantization framework. We then focus on learning a general-purpose INR which can compress different forms of data such as 2D images/videos and can potentially be extended to the audio or language domain as well. This can be extended to compression of 3D objects and scenes as well. Finally, while INRs can represent 3D information, they are slow to train and render which are important for various real-time 3D applications. We utilize 3D Gaussian Splatting (3D-GS), a form of explicit representation for 3D scenes or objects. 3D-GS is quite fast to train and render, but consume large amounts of memory and are especially inefficient for modeling dynamic scenes or 3D videos. We first develop a framework for efficiently training and compressing 3D-GS for static scenes. We achieve large reductions in storage memory, runtime memory, training and rendering time costs while maintaining high reconstruction quality. Next, we extend to dynamic scenes or 3D videos, developing an online streamable framework for 3D-GS. We learn per-frame 3D-GS and learn/transmit only the residuals for 3D-GS attributes achieving large reductions in per-frame storage memory for online streamable 3D-GS while also reducing training time costs and maintaining high rendering speeds and reconstruction quality.Item Interpreting Visual Representations and Mitigating their Failures(2024) Kalibhat, Neha; Feizi, Soheil; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has become the cornerstone of artificial intelligence (AI), particularly in language and computer vision domains. The progression in this field is reflected in numerous applications accessible to the general public, such as information retrieval via virtual assistants, content generation, autonomous vehicles, drug discovery, and medical imaging. This unprecedented rate of AI adoption raises the critical need for research on the fundamental underpinnings of deep neural networks to understand what leads to their decisions and why they fail. This thesis concentrates on self-supervised representation learning, a prevalent unsupervised method employed by foundational models to extract patterns from extensive visual data. Specifically, our focus lies in examining the low-dimensional representations generated by these models and dissecting their failure modes. In our initial investigation, we discover that self-supervised representations lack robustness to domain shifts, as they are not explicitly trained to distinguish image content from its domain. We remedy this issue by proposing a module that can be plugged into existing self-supervised baselines to disentangle their representation spaces and promote domain invariance and generalization. Our subsequent analysis delves into the patterns within representations that influence downstream classification. We scrutinize the discriminative capacity of individual features and their activations. We then propose an unsupervised quality metric that can preemptively determine whether a given representation will be correctly or incorrectly classified, with high precision. In the next segment of this thesis, we leverage our findings to further demystify the representation space, by uncovering interpretable subspaces which have unique concepts associated with them. We design a novel explainability framework that uses a vision-language model (such as CLIP) to provide natural language explanations for neural features (or groups) of a given pre-trained model. We next investigate the role of augmentations and format transformations in learning generalizable visual representations. Drawing inspiration from advancements in audio and speech modalities, we examine how presenting visual data in multiple formats affects learning, separating this from the impact of augmentations. In the final segment, we reveal compositionality as a notable failure mode in current state-of-the-art representation methods. We critique the use of fixed-size patches in vision transformers and demonstrate the benefits of employing semantically meaningful patches based on visual priors. This design adjustment leads to significant improvements in image-text retrieval tasks and, more importantly, enhances performance on compositionality benchmarks.Item Advance Video Modeling Techniques for Video Generation and Enhancement Tasks(2024) Shrivastava, Gaurav; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis investigates advanced techniques that are useful in video modeling for generation and enhancement tasks. In the first part of the thesis, we explore generative modeling that exploits the external corpus for learning priors. The task here is of video prediction, i.e., to extrapolate future sequences given a few context frames. In a followup work we also demonstrate how can we reduce the inference time further and make the video prediction model more efficient. Additionally, we demonstrate that we are not only able to extrapolate one future sequence from a given context frame but multiple sequences given context frames. In the second part, we explore the methods that exploit internal statistics of videos to perform various restoration and enhancement tasks. Here, we show how robustly they perform the restoration tasks like denoising, super-resolution, frame interpolation, and object removal tasks. Furthermore, in a follow-up work, we utilize the inherent compositionality of videos and internal statistics to perform a wider variety of enhancement tasks such as relighting, dehazing, and foreground/background manipulations. Lastly, we provide insight into our future work on how data-free enhancement techniques could be improved. Additionally, we provide further insights on how multisteps video prediction techniques can be improved.Item Dynamical Memory in Deep Neural Networks -(2024) Evanusa, Matthew S; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this work, I will begin to lay out a roadmap or framework for which I believe will serve the scientific communities of artificial intelligence and cognitive neuroscience of interest, in future development and design of a thinking intelligent machine, based on the accumulated knowledge I have gathered across many sources: from my advisors, peers and colleagues, collaborators, talks, symposia and conferences, and long paper dives, for the almost decade that I have spent at my new home in College Park, Maryland. It is my hope and intent that this thesis serves in its stated goal to advance the science of memory integration in neural networks, but in addition, to further the distant dream of discovering the mystery of what it means to be alive. It is important to note that while this thesis is focused on the critical integration of memory mechanisms into artificial neural networks, the authors’ larger goal is the creation of an overarching cognitive architecture that takes advantages of the right amount of advances from deep learning, with the right amount of insights from cognitive and neuroscience - a ”Goldilocks” of sorts for AI. It is my hope that through understanding mechanisms of memory and how they interact with our stimluli, we move one step closer to understanding our place in the cosmos.Item Efficient learning-based sound propagation for virtual and real-world audio processing applications(2024) Ratnarajah, Anton Jeran; Manocha, Dinesh; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sound propagation is the process by which sound energy travels through a medium, such as air, to the surrounding environment as sound waves. The room impulse response (RIR) describes this process and is influenced by the positions of the source and listener, the room's geometry, and its materials. Physics-based acoustic simulators have been used for decades to compute accurate RIRs for specific acoustic environments. However, we have encountered limitations with existing acoustic simulators. For example, they require a 3D representation and detailed material knowledge of the environment. To address these limitations, we propose three novel solutions. First, we introduce a learning-based RIR generator that is two orders of magnitude faster than an interactive ray-tracing simulator. Our approach can be trained to input both statistical and traditional parameters directly, and it can generate both monaural and binaural RIRs for both reconstructed and synthetic 3D scenes. Our generated RIRs outperform interactive ray-tracing simulators in speech-processing applications, including Automatic Speech Recognition (ASR), Speech Enhancement, and Speech Separation, by 2.5%, 12%, and 48%, respectively. Secondly, we propose estimating RIRs from reverberant speech signals and visual cues in the absence of a 3D representation of the environment. By estimating RIRs from reverberant speech, we can augment training data to match test data, improving the word error rate of the ASR system. Our estimated RIRs achieve a 6.9% improvement over previous learning-based RIR estimators in real-world far-field ASR tasks. We demonstrate that our audio-visual RIR estimator aids tasks like visual acoustic matching, novel-view acoustic synthesis, and voice dubbing, validated through perceptual evaluation. Finally, we introduce IR-GAN to augment accurate RIRs using real RIRs. IR-GAN parametrically controls acoustic parameters learned from real RIRs to generate new RIRs that imitate different acoustic environments, outperforming Ray-tracing simulators on the Kaldi far-field ASR benchmark by 8.95%.Item Algorithmic Decision-making and Model Evaluation in Socially Consequential Domains(2024) Herlihy, Christine Robie; Dickerson, John P.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Algorithms are increasingly used to create markets, discover and disseminate information, incentivize behaviors, and inform real-world decision-making in a variety of socially consequential domains. In such settings, algorithms have the potential to improve aggregate utility by leveraging previously acquired knowledge, reducing transaction costs, and facilitating the efficient allocation of resources, broadly construed. However, ensuring that the distribution over outcomes induced by algorithmic decision-making renders the broader system sustainable---i.e., by preserving rationality of participation for a diverse set of stakeholders, and identifying and mitigating the costs associated with unevenly distributed harms---remains challenging. One set of challenges arises during algorithm or model development: here, we must decide how to operationalize sociotechnical constructs of interest, induce prosocial behavior, balance uncertainty-reducing exploration and reward-maximizing exploitation, and incorporate domain-specific preferences and constraints. Common desiderata such as individual or subgroup fairness, cooperation, or risk mitigation often resist uncontested analytic expression, induce combinatorial relations, or are at odds with unconstrained optimization objectives and must be carefully incorporated or approximated so as to preserve utility and tractability. Another set of challenges arises during model evaluation: here, we must contend with small sample sizes and high variance when estimating performance for intersectional subgroups of interest, and determine whether observed performance on domain-specific reasoning tasks may be upwardly biased due to annotation artifacts or data contamination. In this thesis, we propose algorithms and evaluation methods to address these challenges and show how our methods can be applied to improve algorithmic acceptability and decision-making in the face of uncertainty in public health and conversational recommendation systems. Our core contributions include: (1) novel resource allocation algorithms to incorporate prosocial constraints while preserving utility in the restless bandit setting; (2) model evaluation techniques to inform harms identification and mitigation efforts; and (3) prompt-based interventions and meta-policy learning strategies to improve expected utility by encouraging context-aware uncertainty reduction in large language model (LLM)-based recommendation systems.Item Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes(2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.Item Enhanced Robot Planning and Perception Through Environment Prediction(2024) Sharma, Vishnu Dutt; Tokekar, Pratap; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Mobile robots rely on maps to navigate through an environment. In the absence of any map, the robots must build the map online from partial observations as they move in the environment. Traditional methods build a map using only direct observations. In contrast, humans identify patterns in the observed environment and make informed guesses about what to expect ahead. Modeling these patterns explicitly is difficult due to the complexity in the environments. However, these complex models can be approximated well using learning-based methods in conjunction with large training data. By extracting patterns, robots can use not only direct observations but also predictions of what lies ahead to better navigate through an unknown environment. In this dissertation, we present several learning-based methods to equip mobile robots with prediction capabilities for efficient and safer operation. In the first part of the dissertation, we learn to predict using geometrical and structural patterns in the environment. Partially observed maps provide invaluable cues for accurately predicting the unobserved areas. We first demonstrate the capability of general learning-based approaches to model these patterns for a variety of overhead map modalities. Then we employ task-specific learning for faster navigation in indoor environments by predicting 2D occupancy in the nearby regions. This idea is further extended to 3D point cloud representation for object reconstruction. Predicting the shape of the full object from only partial views, our approach paves the way for efficient next-best-view planning, which is a crucial requirement for energy-constrained aerial robots. Deploying a team of robots can also accelerate mapping. Our algorithms benefit from this setup as more observation results in more accurate predictions and further improves the task efficiency in the aforementioned tasks. In the second part of the dissertation, we learn to predict using spatiotemporal patterns in the environment. We focus on dynamic tasks such as target tracking and coverage where we seek decentralized coordination between robots. We first show how graph neural networks can be used for more scalable and faster inference while achieving comparable coverage performance as classical approaches. We find that differentiable design is instrumental here for end-to-end task-oriented learning. Building on this, we present a differentiable decision-making framework that consists of a differentiable decentralized planner and a differentiable perception module for dynamic tracking. In the third part of the dissertation, we show how to harness semantic patterns in the environment. Adding semantic context to the observations can help the robots decipher the relations between objects and infer what may happen next based on the activity around them. We present a pipeline using vision-language models to capture a wider scene using an overhead camera to provide assistance to humans and robots in the scene. We use this setup to implement an assistive robot to help humans with daily tasks, and then present a semantic communication-based collaborative setup of overhead-ground agents, highlighting the embodiment-specific challenges they may encounter and how they can be overcome. The first three parts employ learning-based methods for predicting the environment. However, if the predictions are incorrect, this could pose a risk to the robot and its surroundings. The third part of the dissertation presents risk management methods with meta-reasoning over the predictions. We study two such methods: one extracting uncertainty from the prediction model for risk-aware planning, and another using a heuristic to adaptively switch between classical and prediction-based planning, resulting in safe and efficient robot navigation.