Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
Recent Submissions
Item Object-Attribute Compositionality for Visual Understanding(2024) Saini, Nirat; Shrivastava, Abhinav Dr; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object appearances evolve overtime, which results in visually discernible changes in their colors, shapes, sizes and materials. Humans are innately good at recognizing and understanding the evolution of object states, which is also crucial for visual understanding across images and videos. However, current vision models still struggle to capture and account for these subtle changes to recognize the objects and underlying action causing the changes. This thesis focuses on using compositional learning for recognition and generation of attribute-object pairs. In the first part, we propose to disentangle visual features for object and attributes, to generalize recognition for novel object-attribute pairs. Next, we extend this approach to learn entirely unseen attribute-object pairs, by using semantic language priors, label smoothing and propagation techniques. Further, we use object states for action recognition in videos where subtle changes in object attributes and affordances help in identifying state-modifying and context-transforming actions. All of these methods for decomposing and composing objects and states generalize to unseen pairs and out-of-domain datasets for various compositional zero-shot learning and action recognition tasks. In the second part, we propose a new benchmark suite Chop \& Learn for a novel task of Compositional Image Generation as well as discuss the implications of these approaches for other compositional tasks in images, videos, and beyond. We further extend insertion and editing of attributes of objects consistently across frames of videos, using off-the-shelf training free architecture and discuss the future challenges and opportunities of compositionality for visual understanding.Item Feedback for Vision(2024) Maynord, Michael; Aloimonos, Yiannis; Fermüller, Cornelia; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Feedback plays a prominent role in biological vision, where perception is modulated based on agents' evolving expectations and world model. This is the case both in visually understanding the static structure of the world, as well as in modeling the dynamic structure of action. In this thesis we present first an approach to incorporating controlled feedback into image understanding, second an adaptation of this approach to action understanding, and lastly a notion of feedback in video monitoring. First, we introduce a novel mechanism which modulates perception based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is "expected" its associated linear transformation is applied over feature vectors in a mid level of a network. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training, mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures. We demonstrate the utility of MVF for object classification across three popular datasets and multiple architectures, including both Convolutional Neural Network architectures and a Transformer architecture. We adapt MVF for action understanding with Sub-Action Modulation (SAM) for Video Networks. When humans interpret action they bring high level expectations of the context in which those actions are being performed. Along this thinking, we develop an approach to incorporating context into action understanding. Video segments are classified uniquely into a small set of action primitives (called Therbligs), which are grouped hierarchically into "Meta-Therbligs" as a context representation. SAM is an approach to first modeling Meta-Therbligs, and then incorporating expectation of Meta-Therbligs into mid-level processes through feedback. This allows the modulation of mid-level features in accordance with a temporally compositional representation of context. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness. We demonstrate the utility of SAM over four popular video understanding architectures - I3D, MoViNet, TimeSFormer, and ViViT. Experiments over EPIC Kitchens and 50 Salads on the tasks of action recognition \& anticipation demonstrate SAM produces superior accuracies across all models, tasks, and datasets with minimal architectural alterations. Lastly, we consider a notion of “feedback” where high level expectations, or specifications, are provided by human operators, allowing integration of humans into the perceptual loop . This is important for interfacing with humans, as perceptual tasks which are conventionally left entirely to human labor are increasingly (yet, thus, imperfectly) automated. We consider the task of surveillance. Security watchstanders who monitor multiple videos over long periods of time can be susceptible to information overload and fatigue. To address this, we present a configurable perception pipeline architecture, called the {\it Image Surveillance Assistant} (ISA), for assisting watchstanders with video surveillance tasks. We also present ISA$_1$, an initial implementation that can be configured with a set of {\em context specifications} which watchstanders can select or provide to indicate what imagery should generate notifications. ISA$_1$'s inputs include (1) an image and (2) context specifications, which contain English sentences and a decision boundary defined over object detection vectors. ISA$_1$ assesses the match of the image with the contexts by comparing (1) detected versus specified objects and (2) automatically-generated versus specified captions. Finally, we present a study to assess the utility of using captions in ISA$_1$, and found that they substantially improve the performance of image context detection. Finally, notions of context and the contrast used to separate context for better manipulation in the above feedback work can be of benefit not only to feedback architectures, but within feed-forward architectures as well. We apply this intuition to the task of action understanding in video, where input is separated into motion and ``context''. Motivated by Goldman's Theory of Human Action - a framework in which action decomposes into 1) base physical movements, and 2) the context in which they occur - we propose a novel learning formulation for motion and context, where context is derived as the complement to motion. More specifically, we model physical movement through the adoption of Therbligs, a set of elemental physical motions centered around object manipulation. Context is modeled through the use of a contrastive mutual information loss that formulates context information as the action information not contained within movement information. We empirically prove the utility brought by this separation of representation, showing sizable improvements in action recognition and action anticipation accuracies for a variety of models. We present results over two object manipulation datasets: EPIC Kitchens 100, and 50 Salads.Item Machine Learning with Differentiable Physics Priors(2024) Qiao, Yiling; Lin, Ming ML; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Differentiable physics priors enable gradient-based learning systems to adhere to physical dynamics. By making physics simulations differentiable, we can backpropagate through the physical consequences of actions. This pipeline allows agents to quickly learn to achieve desired effects in the physical world and is an effective technique for solving inverse problems in physical or dynamical systems. This new programming paradigm bridges model-based and data-driven methods, mitigating data scarcity and model bias simultaneously. My research focuses on developing scalable, powerful, and efficient differentiable physics simulators. We have created state-of-the-art differentiable physics for rigid bodies, cloth, fluids, articulated bodies, and deformable solids, achieving performance orders of magnitude better than existing alternatives. These differentiable simulators are applied to solve inverse problems, train control policies, and enhance reinforcement learning algorithms.Item Enhancing Modern Query Federation Systems: Execution Optimization, Performance Prediction, and Systems Assessment(2024) Song, Chujun; Abadi, Daniel; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Modern applications distribute operational data across various storage systems in different locations, simplifying application development but complicating data analytics. The prevalent solution has been to use an ETL (Extract, Transform, Load) process to consolidate data from different locations into a centralized data warehouse for further analytical processing. However, this method is computationally intensive, error-prone, compromises data freshness, and creates a scalability bottleneck in performing analytical tasks within such a centralized data warehouse. In contrast, query federation offers a promising alternative by allowing direct analysis of data in its original location, thereby bypassing the ETL process and avoiding its drawbacks. However, current query federation systems are still far from optimal. We describe our work on enhancing modern query federation systems in three aspects: systems assessment, execution optimization, and performance prediction. Systems Assessment: The concept of query federation is not new, with early implementations such as Mariposa paving the way. Modern systems, including Presto, Trino, and Spark, have further developed and refined this feature, significantly enhancing its functionality and efficiency. Despite these advancements, best practices for designing and implementing these systems remain largely unexplored. To address this gap, we introduce a benchmark specifically designed to evaluate the effects of various design strategies on desirable features of query federation systems. It assesses five representative systems, each employing different design strategies, against this benchmark. This part of the work identifies key bottlenecks in different designs, examines the impact of various query optimization and execution strategies, and explores optimal practices for designing the interface between the execution engine and data sources in query federation systems. Additionally, mitigation strategies for these identified bottlenecks are proposed. Execution Optimization: Among the many design choices in query federation systems, we delve deeper into the efficient workload assignment strategy between the query federation system and the data sources, considering that data sources may also possess query processing capabilities. In response to a query, current approaches typically follow one of two paradigms: they either push as much computation as possible down to the underlying system or treat the underlying system solely as a storage engine, attempting to execute the majority of the query processing work within the query federation system itself. We have observed that these approaches tend to result in CPU underutilization on one side—either within the query federation engine or at the data sources. To tackle this inefficiency, we have developed algorithms capable of adjusting the workload distribution either statically or dynamically on both sides to optimize CPU usage and reduce end-to-end query execution latency. Performance Prediction: Accurate and expedient performance estimation before query execution is vital for tasks such as index recommendation, query optimization, query tuning, and cluster management. However, this task continues to pose significant challenges for query federation systems, which typically integrate computation engines and delegate storage management to the underlying system. This architecture often results in unavailable statistics for some tables, rendering many traditional cost estimation methods—those that rely heavily on detailed statistics—impractical. Furthermore, traditional cost estimation methods frequently encounter substantial errors, particularly in complex queries involving multiple joins. In contrast, machine learning-based approaches offer an alternative strategy by leveraging learned models for cost prediction, which have demonstrated superior performance over traditional methods. However, these models are generally evaluated on synthetic workloads, such as TPC-H, within experimental clusters. Real industrial workloads introduce numerous challenges to the application of such methods. In this segment, we assess these methods using actual industrial workloads from the query federation system deployed at LinkedIn to evaluate the performance of these models. We also introduce a new multi-task learning approach that better utilizes operator-level statistics to improve the accuracy of model prediction. Additionally, we empirically investigate the upper bounds of accuracy achievable by these models.Item Efficient Optimization Algorithms for Nonconvex Machine Learning Problems(2024) Xian, Wenhan; Huang, Heng HH; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In recent years, the success of the AI revolution has led to the training of larger neural networks on vast amounts of data to achieve superior performance. These powerful machine learning models have enabled the creation of remarkable AI products. Optimization, as the core of machine learning, becomes especially crucial because most machine learning problems can ultimately be formulated as optimization problems, which require minimizing a loss function with respect to model parameters based on training samples. To enhance the efficiency of optimization algorithms, distributed learning has emerged as a popular solution for addressing large-scale machine learning tasks. In distributed learning, multiple worker nodes collaborate to train a global model. However, a key challenge in distributed learning is the communication cost. This thesis introduces a novel adaptive gradient algorithm with gradient sparsification to address this issue. Another significant challenge in distributed learning is the communication overhead on the central parameter server. To mitigate this bottleneck, decentralized distributed (serverless) learning has been proposed, where each worker node only needs to communicate with its neighbors. This thesis investigates core nonconvex optimization problems in decentralized settings, including constrained optimization, minimax optimization, and second-order optimality. Efficient optimization algorithms are proposed to solve these problems. Additionally, the convergence analysis of minimax optimization under the generalized smooth condition is explored. A generalized algorithm is proposed, which can be applied to a broader range of applications.Item Hybrid-PGAS Memory Hierarchy for Next Generation HPC Systems(2024) Johnson, Richard Bradford; Hollingsworth, Jeffrey K; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Demands on computational performance, power efficiency, data transfer, resource capacity, and resilience for next generation high performance computing (HPC) systems present a new host of challenges. There is a growing disparity between computational performance vs. network and storage device throughput and among the energy costs of computational, memory, and communication operations. Chapel is a powerful, high-level, parallel, PGAS language designed to streamline development by addressing code complexities and uses a shared memory model for handling large, distributed memory systems. I extended the capabilities of Chapel by providing support of persistent memory with intrinsic and programmatic features for HPC systems. In my approach I explored the efficacy of persistent memory in a hybrid-PGAS environment through latency hiding analysis via cache monitoring, identification and mitigation of performance bottlenecks via data-centric analysis, and hardware profiling to assess performance cost vs. benefits and energy footprint. To manage persistency and ensure resiliency I developed a transaction system with ACID properties that supports hybrid-PGAS virtual addressing and distributed checkpoint and recovery system.Item Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes(2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.Item Advancements in Small Area Estimation Using Hierarchical Bayesian Methods and Complex Survey Data(2024) Das, Soumojit; Lahiri, Partha; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation addresses critical gaps in the estimation of multidimensional poverty measures for small areas and proposes innovative hierarchical Bayesian estimation techniques for finite population means in small areas. It also explores specialized applications of these methods for survey response variables with multiple categories. The dissertation presents a comprehensive review of relevant literature and methodologies, highlighting the importance of accurate estimation for evidence-based policymaking. In Chapter \ref{chap:2}, the focus is on the estimation of multidimensional poverty measures for small areas, filling an essential research gap. Using Bayesian methods, the dissertation demonstrates how multidimensional poverty rates and the relative contributions of different dimensions can be estimated for small areas. The proposed approach can be extended to various definitions of multidimensional poverty, including counting or fuzzy set methods. Chapter \ref{chap:3} introduces a novel hierarchical Bayesian estimation procedure for finite population means in small areas, integrating primary survey data with diverse sources, including social media data. The approach incorporates sample weights and factors influencing the outcome variable to reduce sampling informativeness. It demonstrates reduced sensitivity to model misspecifications and diminishes reliance on assumed models, making it versatile for various estimation challenges. In Chapter \ref{chap: 4}, the dissertation explores specialized applications for survey response variables with multiple categories, addressing the impact of biased or informative sampling on assumed models. It proposes methods for accommodating survey weights seamlessly within the modeling and estimation processes, conducting a comparative analysis with Multilevel Regression with Poststratification (MRP). The dissertation concludes by summarizing key findings and contributions from each chapter, emphasizing implications for evidence-based policymaking and outlining future research directions.Item Understanding and Enhancing Machine Learning Models with Theoretical Foundations(2024) Hu, Zhengmian; Huang, Heng HH; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Machine learning has become a key driver of many contemporary technological advancements. With its empirical success, there is an urgent need for theoretical research to explain and complement these practical achievements. This includes understanding the empirical success of machine learning, especially deep learning, and aiding the design of better algorithms in terms of performance, efficiency, and security. This dissertation aims to advance the understanding and practical development of machine learning through three interrelated research directions, while emphasizing reliable theoretical guarantees throughout the process. In the first part, we study the deep learning theory under overparameterization conditions. The core objects of study are the Conjugate Kernel and Neural Tangent Kernel, which have deep connections to the training dynamics of deep learning. Based on the analysis of these kernels, we prove several new concentration results characterizing the trainability and generalization of infinitely wide neural networks. In the second part, we focus on training algorithms. On one hand, we propose new algorithms to improve learning efficiency. This includes a new underdamped Langevin MCMC method called ALUM, for which we prove its complexity reaches the theoretical lower bound. On the other hand, we propose new theoretical tools to analyze existing algorithms and obtain tighter convergence results. For Proxskip, our analysis shows it can still achieve an improvement in communication complexity from sublinear to linear convergence under stochastic oracle. We also generalize the concept of Lipschitz smoothness for tighter non-convex optimization analysis. In the third part, we develop new Monte Carlo methods to large language models (LLMs) to improve their efficiency and security. We develop unbiased watermarking techniques to protect model outputs and propose an Accelerated Speculative Sampling method for faster inference. We also investigate the trade-off between watermark strength and inference sampling efficiency, pointing out the conflict between the two.Item Improving and validating computational algorithms for the assembly, clustering, and taxonomic classification of microbial communities(2024) Luan, Tu; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recent high-throughput sequencing technologies have advanced the study of microbial communities; nonetheless, analyzing the resulting large datasets still poses challenges. This dissertation focuses on developing and validating computational algorithms to address these challenges in microbial communities' assembly, clustering, and taxonomic classification. We first introduce a novel reference-guided metagenomic assembly approach that delivers high-quality assemblies that generally outperform \textit{de novo} assembly in terms of quality without a significant increase in runtime. Next, We propose SCRAPT, an iterative sampling-based algorithm designed to cluster 16S rRNA gene sequences from large datasets efficiently. In addition, we validate a comprehensive set of genome assembly pipelines using Oxford Nanopore sequencing, achieving near-perfect accuracy through the combination of long and short-read polishing tools. Our research improves the accuracy and efficiency of analyzing complex microbial communities. This dissertation offers insights into the composition and structures of these communities, with potential implications for human, animal, and plant health.Item Enhanced Robot Planning and Perception Through Environment Prediction(2024) Sharma, Vishnu Dutt; Tokekar, Pratap; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Mobile robots rely on maps to navigate through an environment. In the absence of any map, the robots must build the map online from partial observations as they move in the environment. Traditional methods build a map using only direct observations. In contrast, humans identify patterns in the observed environment and make informed guesses about what to expect ahead. Modeling these patterns explicitly is difficult due to the complexity in the environments. However, these complex models can be approximated well using learning-based methods in conjunction with large training data. By extracting patterns, robots can use not only direct observations but also predictions of what lies ahead to better navigate through an unknown environment. In this dissertation, we present several learning-based methods to equip mobile robots with prediction capabilities for efficient and safer operation. In the first part of the dissertation, we learn to predict using geometrical and structural patterns in the environment. Partially observed maps provide invaluable cues for accurately predicting the unobserved areas. We first demonstrate the capability of general learning-based approaches to model these patterns for a variety of overhead map modalities. Then we employ task-specific learning for faster navigation in indoor environments by predicting 2D occupancy in the nearby regions. This idea is further extended to 3D point cloud representation for object reconstruction. Predicting the shape of the full object from only partial views, our approach paves the way for efficient next-best-view planning, which is a crucial requirement for energy-constrained aerial robots. Deploying a team of robots can also accelerate mapping. Our algorithms benefit from this setup as more observation results in more accurate predictions and further improves the task efficiency in the aforementioned tasks. In the second part of the dissertation, we learn to predict using spatiotemporal patterns in the environment. We focus on dynamic tasks such as target tracking and coverage where we seek decentralized coordination between robots. We first show how graph neural networks can be used for more scalable and faster inference while achieving comparable coverage performance as classical approaches. We find that differentiable design is instrumental here for end-to-end task-oriented learning. Building on this, we present a differentiable decision-making framework that consists of a differentiable decentralized planner and a differentiable perception module for dynamic tracking. In the third part of the dissertation, we show how to harness semantic patterns in the environment. Adding semantic context to the observations can help the robots decipher the relations between objects and infer what may happen next based on the activity around them. We present a pipeline using vision-language models to capture a wider scene using an overhead camera to provide assistance to humans and robots in the scene. We use this setup to implement an assistive robot to help humans with daily tasks, and then present a semantic communication-based collaborative setup of overhead-ground agents, highlighting the embodiment-specific challenges they may encounter and how they can be overcome. The first three parts employ learning-based methods for predicting the environment. However, if the predictions are incorrect, this could pose a risk to the robot and its surroundings. The third part of the dissertation presents risk management methods with meta-reasoning over the predictions. We study two such methods: one extracting uncertainty from the prediction model for risk-aware planning, and another using a heuristic to adaptively switch between classical and prediction-based planning, resulting in safe and efficient robot navigation.Item Planning and Perception for Unmanned Aerial Vehicles in Object and Environmental Monitoring(2024) Dhami, Harnaik; Tokekar, Pratap; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Unmanned Aerial vehicles (UAVs) equipped with high-resolution sensors are enabling data collection from previously inaccessible locations on a remarkable spatio-temporal scale. These systems hold immense promise for revolutionizing various fields such as precision agriculture and infrastructure inspection where access to data is important. To fully exploit their potential, the development of autonomy algorithms geared toward planning and perception is critical. In this dissertation, we develop planning and perception algorithms, specifically when UAVs are used for data collection in monitoring applications. In the first part of this dissertation, we study problems of object monitoring and the planning challenges that arise with them. Object monitoring refers to the continuous observation, tracking, and analysis of specific objects within an environment. We start with the problem of visual reconstruction where the planner must maximize visual coverage of a specific object in an unknown environment while minimizing the time and cost. Our goal is to gain as much information about the object as quickly as possible. By utilizing shape prediction deep learning models, we leverage predicted geometry for efficient planning. We further extend this to a multi-UAV system. With a reconstructed 3D digital model, efficient paths around an object can be created for close-up inspection. However, the purpose of inspection is to detect changes in the object. The second problem we study is inspecting an object when it has changed or no prior information about it is known. We study this in the context of infrastructure inspection. We validate our planning algorithm through real-world experiments and high-fidelity simulations. Further, we integrate defect detection into the process. In the second part, we study planning for monitoring entire environments rather than specific objects. Unlike object monitoring, we are interested in environmental monitoring of spatio-temporal processes. The goal of a planner for environmental monitoring is to maximize coverage of an area to understand the spatio-temporal changes in the environment. We study this problem in slow-changing and fast-changing environments. Specifically, we study it in the context of vegetative growth estimation and wildfire management. For the fast-changing wildfire environments, we utilize informative path planning for wildfire validation and localization. Our work also leverages long short-term memory (LSTM) networks for early fire detection.Item SIMULATION, REPRESENTATION, AND AUTOMATION: HUMAN-CENTERED ARTIFICIAL INTELLIGENCE FOR AUGMENTING VISUALIZATION DESIGN(2024) Shin, Sungbok; Elmqvist, Niklas; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Data visualization is a powerful strategy for using graphics to represent data for effective communication and analysis. Unfortunately, creating effective data visualizations is a challenge for both novice and expert design users. The task often involves an iterative process of trial and error, which by its nature, is time-consuming. Designers frequently seek feedback to ensure their visualizations convey the intended message clearly to their target audience. However, obtaining feedback from peers can be challenging, and alternatives like user studies or crowdsourcing is costly and time-consuming. This suggests the potential for a tool that can provide design feedback for visualizations. To that end, I create a virtual, human vision-inspired system that looks into the visualization design and provides feedback on it using various AI techniques. The goal is not to replicate an exact version of a human eye. Instead, my work aims to develop a practical and effective system that delivers design feedback to visualization designers, utilizing advanced AI techniques, such as deep neural networks (DNNs) and large language models (LLMs). My thesis includes three distinct works, each aimed at developing a virtual system inspired by human vision using AI techniques. Specifically, these works focus on simulation, representation, and automation, collectively progressing toward the aim. First, I develop a methodology to simulate human perception in machines through a virtual eye tracker named A SCANNER DEEPLY. This involves gathering eye gaze data from chart images and training them using a DNN. Second, I focus on effectively and pragmatically representing a virtual human vision-inspired system by creating PERCEPTUAL PAT, which includes a suite of perceptually-based filters. Third, I automate the feedback generation process with VISUALIZATIONARY, leveraging large language models to enhance the automation. I report on challenges and lessons learned about the key components and design considerations that help visualization designers. Finally, I end the dissertation by discussing future research directions for using AI for augmenting visualization design process.Item Efficient Rendering, Display, and Compression Techniques for Virtual and Augmented Reality(2024) Jabbireddy, Susmija; Varshney, Amitabh; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Virtual and augmented reality (VR and AR) are bridging the gap between the physical and the virtual worlds. The ultimate goal of VR and AR technology is to present three-dimensional (3D) images at high frame rates for realistic, immersive, and interactive viewing experiences. As the demand for higher resolution in VR and AR devices increases, the computational complexity and the data requirements also increase. This puts a burden on underlying critical resources, such as memory, processing time, and energy consumption, which are essential for storing, rendering, processing, and displaying information. To address these challenges, this research explores methods that harness the inherent structure and redundancy present in the data. By focusing on three key areas – rendering, displays, and compression – this dissertation aims to enable efficient AR/VR systems that enhance resource utilization without compromising the user experience. First, we focus on developing computationally efficient rendering techniques. We begin by discussing various foveated rendering approaches. With the advent of real-time eye tracking systems and an increase in the resolution and field of view in modern AR and VR headsets, foveated rendering becomes crucial to achieve real-time rendering. We review the current state of the field and provide a taxonomy of various foveation techniques that can be used as a guide for developing foveated rendering methods. Then, we investigate methods to improve the image quality from sparse Monte Carlo samples in volumetric rendering. Monte Carlo path tracing has the potential to create stunning visualizations of volumetric data. However, the computational cost of achieving noise-free images is extremely high due to the large number of samples required per pixel. We show how deep-learning based denoising techniques can be integrated with Monte Carlo volumetric rendering to achieve high-quality images at interactive rates. Next, we present our research towards developing energy-efficient holographic displays. Holographic displays are considered true 3D displays, with the potential to emulate all the depth cues of human vision. Nanophotonic phased array (NPA) is a novel emerging technology for holographic displays with compact sizes and very high refresh rates. However, building a large-scale NPA is limited by the significant power consumption and circuit complexity. We present algorithms to generate sparse holographic patterns and show that we can produce high-resolution images with high visual quality using as few as 10% of the display elements. Finally, we explore techniques for efficient compression of multi-view images. As the quantity of 3D data being acquired and processed continues to expand, it leads to increased storage demands and data transmission challenges. We present a deep learning-based multi-view image compression framework that integrates a novel view-aware entropy model with the recent advancements in single-view image compression. By achieving superior compression performance, our approach facilitates more efficient utilization of memory resources when dealing with multi-view data.Item DATA-DRIVEN ALGORITHMS FOR CHARACTERIZING STRUCTURAL VARIATION IN METAGENOMIC DATA(2024) Muralidharan, Harihara Subrahmaniam; Pop, Mihai; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Sequence differences between the strains of bacteria comprising host-associated and environmental microbiota may play a role in community assembly and influence the resilience of microbial communities to disturbances. Tools for characterizing strain-level variation within microbial communities, however, are limited in scope, focusing on just single nucleotide poly-morphisms, or relying on reference-based analyses that miss complex structural variants. In this thesis, we describe data-driven methods to characterize variation in metagenomic data. In the first part of the thesis, I present our analysis of the structural variants identified from metagenomic whole genome shotgun sequencing data. I begin by describing the power of assembly graph analysis to detect over 9 million structural variants such as insertion/deletion, repeat copy-number changes, and mobile elements, in nearly 1,000 metagenomes generated as a part of the Human Microbiome Project. Next, I describe Binnacle, which is a structural variant aware binning algorithm. To improve upon the fragmented nature of assemblies, metagenomic binning is performed to cluster contigs that is likely to have originated from the same genome. We show that binning “graph-based” scaffolds, rather than contigs, improves the quality of the bins, and captures a broader set of the genes of the genomes being reconstructed. Finally, we present a case study of the microbial mats from the Yellowstone National Park. The cyanobacterium Synechococcus is abundant in these mats along a stable temperature gradient from ∼ 50oC to ∼ 65oC and plays a key role in fixing carbon and nitrogen. Previous studies have isolated and generated good quality reference sequences of two major Synechococcus spp. that share a very high genomic content; OS-A and OS-B’. Despite the high abundance of the Synechococcus spp., metagenomic assembly of these organisms is challenging due to the large number of rearrangements between them. We explore the genomic diversity of the Synechococcus spp. using a reference genome, reliant assembly and scaffolding. We also highlight that the variants we detect can be used to fingerprint the local biogeography of the hot spring. In the second part of the thesis, I present our analysis of amplicon sequencing data, specifically the 16S rRNA gene sequences. I begin by describing SCRAPT (Sample, Cluster, Recruit, AdaPt and iTerate), which is a fast iterative algorithm for clustering large 16S rRNA gene datasets. We also show that SCRAPT is able to produce operational taxonomic units that are less fragmented than popular tools: UCLUST, CD-HIT and DNACLUST and runs orders of magnitude faster than existing methods. Finally, we study the impact of transitive annotation on taxonomic classifiers. Taxonomic labels are assigned using machine learning algorithms trained to recognize individual taxonomic groups based on training sequences with known taxonomic labels. Ideally, the training data should rely on experimentally verified-formal taxonomic labels however, the labels associated with sequences in biological databases are most commonly the result of computational predictions– “transitive annotation.” We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. We also discuss key factors that affect the resilience of classifiers to transitively annotated training data and propose best practices to avoid the artifacts described in this thesis.Item OUT OF DISTRIBUTION EVALUATION OF NATURAL LANGUAGE PROCESSING SYSTEMS: GENERALIZATION TO LOW-RESOURCE AND DISTANT LANGUAGES AND HUMAN-AI COLLABORATIVE WRITING(2024) Richburg, Aquia; Carpuat, Marine; Applied Mathematics and Scientific Computation; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Large language models have revolutionized natural language processing with their capabilities in text generation and understanding. Their rich contextual representations learned from training on diverse text datasets have lead LLMs to be used across a variety of settings. However this increases the chance of models being used in unintended use cases and causing harm to users. This dissertation delves into empirical studies of out-of-distribution issues in text generation (machine translation) and text classification (authorship analysis) tasks, examining how LLMs perform in settings distant from their training distributions.In our first work, the goal is to understand the characteristics of the training distribution of LLMs by visualizing the roles of samples during the training of a machine translation model. Our results indicate that sample contributions are not uniform and play complex roles throughout the training process. This highlights the difficulty of describing samples that are representative of the training distribution and motivates thorough evaluation of models in diverse settings. Our second and third works turn to the evaluation of LLMs in out-of-distribution settings to better understand their strengths and limitations for generalization on unseen tasks. We evaluate LLMs in machine translation tasks, focusing on how translation quality is affected by the presence or absence of specific language pairs in the training data. Our findings show that while finetuning improves translation for unseen languages, the impact varies across different language pairs. This emphasizes the need for further research to enable effective massively multilingual translation with LLMs. In text classification, we explore out-of-distribution generalization for authorship analysis in the context of human-AI collaborative writing. Our studies reveal that traditional AI detection models underperform when distinguishing between human and AI cowritten text. Simpler n-gram techniques are more robust than LLM for authorship identification, suggesting the need for adapted authorship analysis tools. In summary this dissertation advances our understanding of LLM generalization and provides insights for improving the robustness and adaptability of NLP systems.Item Zero-knowledge Proofs for Programmable Anonymity, Moderation, and Reputation(2024) Rosenberg, Michael Allan; Miers, Ian; Katz, Jonathan; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Anonymous credentials deal with a core tension of privacy-enhancing technologies (PET), namely the desire to participate in society versus the desire to remain anonymous. But despite decades of research, anonymous credential schemes have not received nearly as much general uptake as other PET such as end-to-end encryption. This is due, in part, to its high barriers of design and deployment. Many existing anonymous credential schemes are constructed by first fixing notions of identity and what should selectively be revealed, and then designing towards that goal. This yields just-so schemes built on primitives like Pedersen commitments and blind signatures. But while these schemes are often efficient, they often require an expert redesign when the notion of identity changes, or the statement to selectively reveal changes (e.g., adding a range proof to a system that previously only permitted equality proofs). It is possible to flip the order of operations, i.e., to design a proof system and then let users program their own notions of identity and what they want to show. Concretely, using modern, general-purpose zero-knowledge proof schemes and their deep tooling, it is possible to design extensible solutions to the problems of identity, moderation and reputation. In this dissertation, I present research which builds novel, extensible, and practical privacy-enhancing technologies from succinct noninteractive zero-knowledge proofs of knowledge (zkSNARKs). These works are: SNARKBlock—a scalable anonymous blocklisting scheme, zk-creds—a construction of anonymous credentials which are bootstrappable from existing government-issued documents, and zk-promises—a framework for asynchronous anonymous blocklisting and reputation which supports complex notions of reputation.Item ADVANCED TECHNIQUES FOR RECONSTRUCTING OBJECTS AND SCENES FROM VARIATIONS IN LIGHTING AND VIEWPOINT(2024) Lichy, Daniel Jesse; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Capturing the shape and material of objects and scenes is a cornerstone of computer vision research, with significant applications across augmented reality, e-commerce, healthcare, real estate, and robotics. This thesis explores two primary capture methods: Multiview Stereo (MVS), which leverages varying viewpoints, and Photometric Stereo (PS), which utilizes changes in lighting. To address some of the limitations inherent in these techniques, we introduce several novel methods. In the first part, we present a user-friendly PS setup requiring only a camera, a flashlight, and optionally a tripod—simple enough for home assembly. To support high-resolution captures from this setup, we introduce RecNet, a novel recursive architecture trained on low-resolution synthetic data yet capable of predicting high-resolution geometry and reflectance. RecNet demonstrably outperforms state-of-the-art PS systems, even with only a few input images. Traditionally, PS assumes that lighting is distant, which is impractical for large objects or those in confined spaces. Building on RecNet, we propose a novel method that integrates per-pixel lighting estimates and recursive depth estimation to address the challenges of near-field lighting, thus broadening PS's applicability. While PS excels at capturing fine details, it often struggles with global geometry, introducing low-frequency distortions that complicate the stitching of multiple views into a complete object. Conversely, MVS captures global geometry effectively but tends to miss finer details. In the second part, we address the so-called Multiview Photometric Stereo (MVPS) problem, which leverages variations in both lighting and viewpoint. Our feedforward architecture, inspired by both MVS and PS techniques, enables geometry reconstruction that matches or exceeds the state-of-the-art in quality, while being orders of magnitude faster. In scenarios where adjusting lighting conditions is impractical, such as in large or outdoor scenes, changing viewpoints often proves more feasible, especially when cameras are mounted on mobile platforms like drones or vehicles. Large field of view (FoV) cameras are preferable for these expansive scenes, as they enable faster and easier capture. However, adapting MVS models developed for small-FoV to large-FoV requires significant modifications and traditionally depends on scarce large-FoV training data. In the third part, we introduce novel architectures and data augmentation techniques to train networks on the abundant small-FoV data but allow them to generalize to large-FoV scenarios. This approach demonstrates strong generalization capabilities across both indoor and outdoor datasets, effectively eliminating the need to acquire costly large-FoV-specific datasets for training large-FoV MVS models. Through these contributions, we aim to streamline and enhance the capture of shape and material, making it faster and more practical for a broad range of users—from casual hobbyists to industrial systems.Item AI Empowered Music Education(2024) Shrestha, Snehesh; Aloimonos, Yiannis; Fermüller, Cornelia; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Learning a musical instrument is a complex process involving years of practice and feedback. However, dropout rates in music programs, particularly among violin students, remain high due to socio-economic barriers and the challenge of mastering the instrument. This work explores the feasibility of accelerating learning and leveraging technology in music education, with a focus on bowed string instruments, specifically the violin. My research identifies workflow gaps and challenges for the stakeholders, aiming to address not only the improvement of learning outcomes but also the provision of opportunities for socioeconomically challenged students. Three key areas are emphasized: designing user studies and creating a comprehensive violin dataset, developing tools and deep learning algorithms for accurate performance assessment, and crafting a practice platform for student feedback. Three fundamental perspectives were essential: a) understanding the stakeholders and their specific challenges, b) understanding how the instrument operates and what actions the player must master to control its functions, and c) addressing the technical challenges associated with constructing and implementing detection and feedback systems. The existing datasets were inadequate for analyzing violin playing, primarily due to their lack of diversity of body types and skill levels, as well as the absence of well-synchronized and calibrated video data, along with corresponding ground truth 3D poses and musical events. Our experiment design was ensured that the collected data would be suitable for subsequent tasks downstream. These considerations played a significant role in determining the metrics used to evaluate the accuracy of the data and the success metrics for the subsequent tasks. At the foundation of movement analysis lies 3D human pose estimation. Unfortunately, the current state-of-the-art algorithms face challenges in accurately estimating monocular 3D poses during instrument playing. These challenges arise from factors such as occlusions, partial views, human-object interactions, limited viewing angles, pixel density, and camera sampling rates. To address these issues, we developed a novel 3D pose estimation algorithm based on the insight that the music produced by the violin is a direct result of the corresponding motions. Our algorithm integrates visual observations with audio inputs to generate precise, high-resolution 3D pose estimates that are temporally consistent and conducive to downstream tasks. Providing effective feedback to learners is a nuanced process that requires balancing encouragement with challenge. Without a user-friendly interface and a motivational strategy, feedback runs the risk of being counterproductive. While current systems excel at detecting pitch and temporal misalignments and visually displaying them for analysis, they often overwhelm players. In this dissertation, we introduce two novel feedback systems. The first is a visual-haptic feedback system that overlays simple augmented cues on the user's body, gently guiding them back to the correct posture. The second is a haptic band synchronized with the music, enhancing students' perception of rhythmic timing and bowing intensities. Additionally, we developed an intuitive user interface for real-time feedback during practice sessions and performance reviews. This data can be shared with teachers for deeper insights into students' struggles and track progress. This research aims to empower both students and teachers. By providing students with feedback during individual practice sessions and equipping teachers with tools to monitor and tailor AI interventions according to their preferences, this work serves as a valuable teaching assistant. By addressing tasks that teachers may not prefer or physically perform, such as personalized feedback and progress tracking, this research endeavors to democratize access to high-quality music education and mitigate dropout rates in music programs.Item Approximate Nearest Neighbor Search with Filters(2024) Landrum, Benjamin Thomas; Dhulipala, Laxman; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Approximate nearest neighbor search (ANNS) on high-dimensional vectors is a fundamental primitive used widely for search over neural embeddings of unstructured data. Prior work on ANNS has produced indices which provide fast and accurate search on datasets up to billions of points, but are not well suited to queries restricted to some subset of the original dataset. Filtered ANNS is a formulation of the problem which adds metadata to points in the dataset which can be used to filter points at query time. This setting requires indexing a dataset in a metadata-aware way to support filtered queries. Filtered ANNS is important for applications such as product and image search, and necessary to give recently popular `vector databases' functionality similar to more traditional tabular databases. This work concerns two versions of the filtered ANNS problem. The most popular formulation in prior work associates points with boolean metadata in the form of labels and filters queries using a boolean predicate on these labels. In this setting, we present a novel index with state-of-the-art performance for queries with filters requiring either one label or both of a pair of labels which won a large benchmarking competition's track focused on the problem. We also introduce a novel formulation of filtered ANNS called `window filtered' ANNS, in which points are associated with a continuous metadata value (in practical use, this corresponds to a timestamp, measure of popularity, etc.), and queries are filtered to a range of metadata values. In addition to describing the problem, we present a practical and theoretically motivated index which handily outperforms baselines.