Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 10 of 79
  • Thumbnail Image
    Item
    Developing and Measuring Latent Constructs in Text
    (2024) Hoyle, Alexander Miserlis; Resnik, Philip; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Constructs---like inflation, populism, or paranoia---are of fundamental concern to social science. Constructs are the vocabulary over which theory operates, and so a central activity is the development and measurement of latent constructs from observable data. Although the social sciences comprise fields with different epistemological norms, they share a concern for valid operationalizations that transparently map between data and measure. Economists at the US Bureau of Labor Statistics, for example, follow a hundred-page handbook to sample the egg prices that constitute the Consumer Price Index; Clinical psychologists rely on suites of psychometric tests to diagnose schizophrenia. In many fields, this observable data takes the form of language: as a social phenomenon, language data can encode many of the latent social constructs that people care about. Commensurate with both increasing sophistication in language technologies and amounts of available data, there has thus emerged a "text-as-data" paradigm aimed at "amplifying and augmenting" the analyses that compose research. At the same time, Natural Language Processing (NLP), the field from which analysis tools originate, has often remained separate from real-world problems and guiding theories---as least when it comes to social science. Instead, it focuses on atomized tasks under the assumption that progress on low-level language aspects will generalize to higher-level problems that involve overlapping elements. This dissertation focuses on NLP methods and evaluations that facilitate the development and measurement of latent constructs from natural language, while remaining sensitive to social sciences' need for interpretability and validity.
  • Thumbnail Image
    Item
    Everything Efficient All at Once - Compressing Data and Deep Networks
    (2024) Girish, Sharath; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this thesis, we examine the efficiency of deep networks and data, both of which are widely used in various computer vision/AI applications and are ubiquitous in today's information age. As deep networks continue to grow exponentially in size, improving their efficiency in terms of size and computation becomes necessary for deploying across various mobile/small devices with hardware constraints. Data efficiency is also equivalently important due to the memory and network speed bottlenecks when transmitting and storing data which is also being created and transmitted at an exponential rate. In this work, we explore in detail, various approaches to improve the efficiency of deep networks, as well as perform compression of various forms of data content. Efficiency of deep networks involves two major aspects; size, or the memory required to store deep networks on disk, and computation, or the number of operations/time taken to execute the network. The first work analyzes sparsity for computation reduction in the context of vision tasks which involve a large pretraining stage followed by downstream task finetuning. We show that task specific sparse subnetworks are more efficient than generalized sparse subnetworks which are more dense and do not transfer very well. We analyze several behaviors of training sparse networks for various vision tasks. While efficient, this sparsity theoretically focuses on only computation reduction and requires dedicated hardware for practical deployment. We therefore develop a framework for simultaneously reducing size and computation by utilizing a latent quantization-framework along with regularization losses. We compress convolutional networks by more than an order of magnitude in size while maintaining accuracy and speeding up inference without dedicated hardware. Data can take different forms such as audio, language, image, or video. We develop approaches for improving the compression and efficiency of various forms of visual data which take up the bulk of global network traffic as well as storage. This consists of 2D images or videos and, more recently, their 3D equivalents of static/dynamic scenes which are becoming popular for immersive AR/VR applications, scene understanding, 3D-aware generative modeling, and so on. To achieve data compression, we utilize Implicit Neural Representations (INRs) which represent data signals in terms of deep network weights. We transform the problem of data compression into network compression, thereby learning efficient data representations. We first develop an algorithm for compression of 2D videos via autoregressive INRs whose weights are compressed by utilizing the latent-quantization framework. We then focus on learning a general-purpose INR which can compress different forms of data such as 2D images/videos and can potentially be extended to the audio or language domain as well. This can be extended to compression of 3D objects and scenes as well. Finally, while INRs can represent 3D information, they are slow to train and render which are important for various real-time 3D applications. We utilize 3D Gaussian Splatting (3D-GS), a form of explicit representation for 3D scenes or objects. 3D-GS is quite fast to train and render, but consume large amounts of memory and are especially inefficient for modeling dynamic scenes or 3D videos. We first develop a framework for efficiently training and compressing 3D-GS for static scenes. We achieve large reductions in storage memory, runtime memory, training and rendering time costs while maintaining high reconstruction quality. Next, we extend to dynamic scenes or 3D videos, developing an online streamable framework for 3D-GS. We learn per-frame 3D-GS and learn/transmit only the residuals for 3D-GS attributes achieving large reductions in per-frame storage memory for online streamable 3D-GS while also reducing training time costs and maintaining high rendering speeds and reconstruction quality.
  • Item
    Interpreting Visual Representations and Mitigating their Failures
    (2024) Kalibhat, Neha; Feizi, Soheil; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Deep learning has become the cornerstone of artificial intelligence (AI), particularly in language and computer vision domains. The progression in this field is reflected in numerous applications accessible to the general public, such as information retrieval via virtual assistants, content generation, autonomous vehicles, drug discovery, and medical imaging. This unprecedented rate of AI adoption raises the critical need for research on the fundamental underpinnings of deep neural networks to understand what leads to their decisions and why they fail. This thesis concentrates on self-supervised representation learning, a prevalent unsupervised method employed by foundational models to extract patterns from extensive visual data. Specifically, our focus lies in examining the low-dimensional representations generated by these models and dissecting their failure modes. In our initial investigation, we discover that self-supervised representations lack robustness to domain shifts, as they are not explicitly trained to distinguish image content from its domain. We remedy this issue by proposing a module that can be plugged into existing self-supervised baselines to disentangle their representation spaces and promote domain invariance and generalization. Our subsequent analysis delves into the patterns within representations that influence downstream classification. We scrutinize the discriminative capacity of individual features and their activations. We then propose an unsupervised quality metric that can preemptively determine whether a given representation will be correctly or incorrectly classified, with high precision. In the next segment of this thesis, we leverage our findings to further demystify the representation space, by uncovering interpretable subspaces which have unique concepts associated with them. We design a novel explainability framework that uses a vision-language model (such as CLIP) to provide natural language explanations for neural features (or groups) of a given pre-trained model. We next investigate the role of augmentations and format transformations in learning generalizable visual representations. Drawing inspiration from advancements in audio and speech modalities, we examine how presenting visual data in multiple formats affects learning, separating this from the impact of augmentations. In the final segment, we reveal compositionality as a notable failure mode in current state-of-the-art representation methods. We critique the use of fixed-size patches in vision transformers and demonstrate the benefits of employing semantically meaningful patches based on visual priors. This design adjustment leads to significant improvements in image-text retrieval tasks and, more importantly, enhances performance on compositionality benchmarks.
  • Thumbnail Image
    Item
    Advance Video Modeling Techniques for Video Generation and Enhancement Tasks
    (2024) Shrivastava, Gaurav; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis investigates advanced techniques that are useful in video modeling for generation and enhancement tasks. In the first part of the thesis, we explore generative modeling that exploits the external corpus for learning priors. The task here is of video prediction, i.e., to extrapolate future sequences given a few context frames. In a followup work we also demonstrate how can we reduce the inference time further and make the video prediction model more efficient. Additionally, we demonstrate that we are not only able to extrapolate one future sequence from a given context frame but multiple sequences given context frames. In the second part, we explore the methods that exploit internal statistics of videos to perform various restoration and enhancement tasks. Here, we show how robustly they perform the restoration tasks like denoising, super-resolution, frame interpolation, and object removal tasks. Furthermore, in a follow-up work, we utilize the inherent compositionality of videos and internal statistics to perform a wider variety of enhancement tasks such as relighting, dehazing, and foreground/background manipulations. Lastly, we provide insight into our future work on how data-free enhancement techniques could be improved. Additionally, we provide further insights on how multisteps video prediction techniques can be improved.
  • Thumbnail Image
    Item
    Dynamical Memory in Deep Neural Networks -
    (2024) Evanusa, Matthew S; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In this work, I will begin to lay out a roadmap or framework for which I believe will serve the scientific communities of artificial intelligence and cognitive neuroscience of interest, in future development and design of a thinking intelligent machine, based on the accumulated knowledge I have gathered across many sources: from my advisors, peers and colleagues, collaborators, talks, symposia and conferences, and long paper dives, for the almost decade that I have spent at my new home in College Park, Maryland. It is my hope and intent that this thesis serves in its stated goal to advance the science of memory integration in neural networks, but in addition, to further the distant dream of discovering the mystery of what it means to be alive. It is important to note that while this thesis is focused on the critical integration of memory mechanisms into artificial neural networks, the authors’ larger goal is the creation of an overarching cognitive architecture that takes advantages of the right amount of advances from deep learning, with the right amount of insights from cognitive and neuroscience - a ”Goldilocks” of sorts for AI. It is my hope that through understanding mechanisms of memory and how they interact with our stimluli, we move one step closer to understanding our place in the cosmos.
  • Item
    Algorithmic Decision-making and Model Evaluation in Socially Consequential Domains
    (2024) Herlihy, Christine Robie; Dickerson, John P.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Algorithms are increasingly used to create markets, discover and disseminate information, incentivize behaviors, and inform real-world decision-making in a variety of socially consequential domains. In such settings, algorithms have the potential to improve aggregate utility by leveraging previously acquired knowledge, reducing transaction costs, and facilitating the efficient allocation of resources, broadly construed. However, ensuring that the distribution over outcomes induced by algorithmic decision-making renders the broader system sustainable---i.e., by preserving rationality of participation for a diverse set of stakeholders, and identifying and mitigating the costs associated with unevenly distributed harms---remains challenging. One set of challenges arises during algorithm or model development: here, we must decide how to operationalize sociotechnical constructs of interest, induce prosocial behavior, balance uncertainty-reducing exploration and reward-maximizing exploitation, and incorporate domain-specific preferences and constraints. Common desiderata such as individual or subgroup fairness, cooperation, or risk mitigation often resist uncontested analytic expression, induce combinatorial relations, or are at odds with unconstrained optimization objectives and must be carefully incorporated or approximated so as to preserve utility and tractability. Another set of challenges arises during model evaluation: here, we must contend with small sample sizes and high variance when estimating performance for intersectional subgroups of interest, and determine whether observed performance on domain-specific reasoning tasks may be upwardly biased due to annotation artifacts or data contamination. In this thesis, we propose algorithms and evaluation methods to address these challenges and show how our methods can be applied to improve algorithmic acceptability and decision-making in the face of uncertainty in public health and conversational recommendation systems. Our core contributions include: (1) novel resource allocation algorithms to incorporate prosocial constraints while preserving utility in the restless bandit setting; (2) model evaluation techniques to inform harms identification and mitigation efforts; and (3) prompt-based interventions and meta-policy learning strategies to improve expected utility by encouraging context-aware uncertainty reduction in large language model (LLM)-based recommendation systems.
  • Item
    Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes
    (2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.
  • Thumbnail Image
    Item
    Enhanced Robot Planning and Perception Through Environment Prediction
    (2024) Sharma, Vishnu Dutt; Tokekar, Pratap; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Mobile robots rely on maps to navigate through an environment. In the absence of any map, the robots must build the map online from partial observations as they move in the environment. Traditional methods build a map using only direct observations. In contrast, humans identify patterns in the observed environment and make informed guesses about what to expect ahead. Modeling these patterns explicitly is difficult due to the complexity in the environments. However, these complex models can be approximated well using learning-based methods in conjunction with large training data. By extracting patterns, robots can use not only direct observations but also predictions of what lies ahead to better navigate through an unknown environment. In this dissertation, we present several learning-based methods to equip mobile robots with prediction capabilities for efficient and safer operation. In the first part of the dissertation, we learn to predict using geometrical and structural patterns in the environment. Partially observed maps provide invaluable cues for accurately predicting the unobserved areas. We first demonstrate the capability of general learning-based approaches to model these patterns for a variety of overhead map modalities. Then we employ task-specific learning for faster navigation in indoor environments by predicting 2D occupancy in the nearby regions. This idea is further extended to 3D point cloud representation for object reconstruction. Predicting the shape of the full object from only partial views, our approach paves the way for efficient next-best-view planning, which is a crucial requirement for energy-constrained aerial robots. Deploying a team of robots can also accelerate mapping. Our algorithms benefit from this setup as more observation results in more accurate predictions and further improves the task efficiency in the aforementioned tasks. In the second part of the dissertation, we learn to predict using spatiotemporal patterns in the environment. We focus on dynamic tasks such as target tracking and coverage where we seek decentralized coordination between robots. We first show how graph neural networks can be used for more scalable and faster inference while achieving comparable coverage performance as classical approaches. We find that differentiable design is instrumental here for end-to-end task-oriented learning. Building on this, we present a differentiable decision-making framework that consists of a differentiable decentralized planner and a differentiable perception module for dynamic tracking. In the third part of the dissertation, we show how to harness semantic patterns in the environment. Adding semantic context to the observations can help the robots decipher the relations between objects and infer what may happen next based on the activity around them. We present a pipeline using vision-language models to capture a wider scene using an overhead camera to provide assistance to humans and robots in the scene. We use this setup to implement an assistive robot to help humans with daily tasks, and then present a semantic communication-based collaborative setup of overhead-ground agents, highlighting the embodiment-specific challenges they may encounter and how they can be overcome. The first three parts employ learning-based methods for predicting the environment. However, if the predictions are incorrect, this could pose a risk to the robot and its surroundings. The third part of the dissertation presents risk management methods with meta-reasoning over the predictions. We study two such methods: one extracting uncertainty from the prediction model for risk-aware planning, and another using a heuristic to adaptively switch between classical and prediction-based planning, resulting in safe and efficient robot navigation.
  • Thumbnail Image
    Item
    Planning and Perception for Unmanned Aerial Vehicles in Object and Environmental Monitoring
    (2024) Dhami, Harnaik; Tokekar, Pratap; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Unmanned Aerial vehicles (UAVs) equipped with high-resolution sensors are enabling data collection from previously inaccessible locations on a remarkable spatio-temporal scale. These systems hold immense promise for revolutionizing various fields such as precision agriculture and infrastructure inspection where access to data is important. To fully exploit their potential, the development of autonomy algorithms geared toward planning and perception is critical. In this dissertation, we develop planning and perception algorithms, specifically when UAVs are used for data collection in monitoring applications. In the first part of this dissertation, we study problems of object monitoring and the planning challenges that arise with them. Object monitoring refers to the continuous observation, tracking, and analysis of specific objects within an environment. We start with the problem of visual reconstruction where the planner must maximize visual coverage of a specific object in an unknown environment while minimizing the time and cost. Our goal is to gain as much information about the object as quickly as possible. By utilizing shape prediction deep learning models, we leverage predicted geometry for efficient planning. We further extend this to a multi-UAV system. With a reconstructed 3D digital model, efficient paths around an object can be created for close-up inspection. However, the purpose of inspection is to detect changes in the object. The second problem we study is inspecting an object when it has changed or no prior information about it is known. We study this in the context of infrastructure inspection. We validate our planning algorithm through real-world experiments and high-fidelity simulations. Further, we integrate defect detection into the process. In the second part, we study planning for monitoring entire environments rather than specific objects. Unlike object monitoring, we are interested in environmental monitoring of spatio-temporal processes. The goal of a planner for environmental monitoring is to maximize coverage of an area to understand the spatio-temporal changes in the environment. We study this problem in slow-changing and fast-changing environments. Specifically, we study it in the context of vegetative growth estimation and wildfire management. For the fast-changing wildfire environments, we utilize informative path planning for wildfire validation and localization. Our work also leverages long short-term memory (LSTM) networks for early fire detection.
  • Thumbnail Image
    Item
    AI Empowered Music Education
    (2024) Shrestha, Snehesh; Aloimonos, Yiannis; Fermüller, Cornelia; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Learning a musical instrument is a complex process involving years of practice and feedback. However, dropout rates in music programs, particularly among violin students, remain high due to socio-economic barriers and the challenge of mastering the instrument. This work explores the feasibility of accelerating learning and leveraging technology in music education, with a focus on bowed string instruments, specifically the violin. My research identifies workflow gaps and challenges for the stakeholders, aiming to address not only the improvement of learning outcomes but also the provision of opportunities for socioeconomically challenged students. Three key areas are emphasized: designing user studies and creating a comprehensive violin dataset, developing tools and deep learning algorithms for accurate performance assessment, and crafting a practice platform for student feedback. Three fundamental perspectives were essential: a) understanding the stakeholders and their specific challenges, b) understanding how the instrument operates and what actions the player must master to control its functions, and c) addressing the technical challenges associated with constructing and implementing detection and feedback systems. The existing datasets were inadequate for analyzing violin playing, primarily due to their lack of diversity of body types and skill levels, as well as the absence of well-synchronized and calibrated video data, along with corresponding ground truth 3D poses and musical events. Our experiment design was ensured that the collected data would be suitable for subsequent tasks downstream. These considerations played a significant role in determining the metrics used to evaluate the accuracy of the data and the success metrics for the subsequent tasks. At the foundation of movement analysis lies 3D human pose estimation. Unfortunately, the current state-of-the-art algorithms face challenges in accurately estimating monocular 3D poses during instrument playing. These challenges arise from factors such as occlusions, partial views, human-object interactions, limited viewing angles, pixel density, and camera sampling rates. To address these issues, we developed a novel 3D pose estimation algorithm based on the insight that the music produced by the violin is a direct result of the corresponding motions. Our algorithm integrates visual observations with audio inputs to generate precise, high-resolution 3D pose estimates that are temporally consistent and conducive to downstream tasks. Providing effective feedback to learners is a nuanced process that requires balancing encouragement with challenge. Without a user-friendly interface and a motivational strategy, feedback runs the risk of being counterproductive. While current systems excel at detecting pitch and temporal misalignments and visually displaying them for analysis, they often overwhelm players. In this dissertation, we introduce two novel feedback systems. The first is a visual-haptic feedback system that overlays simple augmented cues on the user's body, gently guiding them back to the correct posture. The second is a haptic band synchronized with the music, enhancing students' perception of rhythmic timing and bowing intensities. Additionally, we developed an intuitive user interface for real-time feedback during practice sessions and performance reviews. This data can be shared with teachers for deeper insights into students' struggles and track progress. This research aims to empower both students and teachers. By providing students with feedback during individual practice sessions and equipping teachers with tools to monitor and tailor AI interventions according to their preferences, this work serves as a valuable teaching assistant. By addressing tasks that teachers may not prefer or physically perform, such as personalized feedback and progress tracking, this research endeavors to democratize access to high-quality music education and mitigate dropout rates in music programs.