Theses and Dissertations from UMD
Permanent URI for this communityhttp://hdl.handle.net/1903/2
New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a give thesis/dissertation in DRUM
More information is available at Theses and Dissertations at University of Maryland Libraries.
Browse
73 results
Search Results
Item Inertially Constrained Ruled Surfaces for Visual Odometry(2024) Zhu, Chenqi; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In computer vision, camera egomotion is typically solved with visual odometry techniques that relies on feature extraction from a sequence of images and computation of the optical flow. This, however, often requires a point-to-point correspondence between two consecutive frames which can often be costly to compute and its varying accuracy greatly affects the quality of estimated motion. Attempts have been made to bypass the difficulties originated from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU), many of which still heavily rely on feature detectors. If the camera observes a straight line as it moves, the image of such line is sweeping a surface, this is a ruled surface and analyzing its shapes gives information about the egomotion. This research presents a novel algorithm to estimate 3D camera egomotion from scenes represented by ruled surfaces. Constraining the egomotion with inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.Item Interpreting Visual Representations and Mitigating their Failures(2024) Kalibhat, Neha; Feizi, Soheil; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has become the cornerstone of artificial intelligence (AI), particularly in language and computer vision domains. The progression in this field is reflected in numerous applications accessible to the general public, such as information retrieval via virtual assistants, content generation, autonomous vehicles, drug discovery, and medical imaging. This unprecedented rate of AI adoption raises the critical need for research on the fundamental underpinnings of deep neural networks to understand what leads to their decisions and why they fail. This thesis concentrates on self-supervised representation learning, a prevalent unsupervised method employed by foundational models to extract patterns from extensive visual data. Specifically, our focus lies in examining the low-dimensional representations generated by these models and dissecting their failure modes. In our initial investigation, we discover that self-supervised representations lack robustness to domain shifts, as they are not explicitly trained to distinguish image content from its domain. We remedy this issue by proposing a module that can be plugged into existing self-supervised baselines to disentangle their representation spaces and promote domain invariance and generalization. Our subsequent analysis delves into the patterns within representations that influence downstream classification. We scrutinize the discriminative capacity of individual features and their activations. We then propose an unsupervised quality metric that can preemptively determine whether a given representation will be correctly or incorrectly classified, with high precision. In the next segment of this thesis, we leverage our findings to further demystify the representation space, by uncovering interpretable subspaces which have unique concepts associated with them. We design a novel explainability framework that uses a vision-language model (such as CLIP) to provide natural language explanations for neural features (or groups) of a given pre-trained model. We next investigate the role of augmentations and format transformations in learning generalizable visual representations. Drawing inspiration from advancements in audio and speech modalities, we examine how presenting visual data in multiple formats affects learning, separating this from the impact of augmentations. In the final segment, we reveal compositionality as a notable failure mode in current state-of-the-art representation methods. We critique the use of fixed-size patches in vision transformers and demonstrate the benefits of employing semantically meaningful patches based on visual priors. This design adjustment leads to significant improvements in image-text retrieval tasks and, more importantly, enhances performance on compositionality benchmarks.Item Object-Attribute Compositionality for Visual Understanding(2024) Saini, Nirat; Shrivastava, Abhinav Dr; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object appearances evolve overtime, which results in visually discernible changes in their colors, shapes, sizes and materials. Humans are innately good at recognizing and understanding the evolution of object states, which is also crucial for visual understanding across images and videos. However, current vision models still struggle to capture and account for these subtle changes to recognize the objects and underlying action causing the changes. This thesis focuses on using compositional learning for recognition and generation of attribute-object pairs. In the first part, we propose to disentangle visual features for object and attributes, to generalize recognition for novel object-attribute pairs. Next, we extend this approach to learn entirely unseen attribute-object pairs, by using semantic language priors, label smoothing and propagation techniques. Further, we use object states for action recognition in videos where subtle changes in object attributes and affordances help in identifying state-modifying and context-transforming actions. All of these methods for decomposing and composing objects and states generalize to unseen pairs and out-of-domain datasets for various compositional zero-shot learning and action recognition tasks. In the second part, we propose a new benchmark suite Chop \& Learn for a novel task of Compositional Image Generation as well as discuss the implications of these approaches for other compositional tasks in images, videos, and beyond. We further extend insertion and editing of attributes of objects consistently across frames of videos, using off-the-shelf training free architecture and discuss the future challenges and opportunities of compositionality for visual understanding.Item Feedback for Vision(2024) Maynord, Michael; Aloimonos, Yiannis; Fermüller, Cornelia; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Feedback plays a prominent role in biological vision, where perception is modulated based on agents' evolving expectations and world model. This is the case both in visually understanding the static structure of the world, as well as in modeling the dynamic structure of action. In this thesis we present first an approach to incorporating controlled feedback into image understanding, second an adaptation of this approach to action understanding, and lastly a notion of feedback in video monitoring. First, we introduce a novel mechanism which modulates perception based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is "expected" its associated linear transformation is applied over feature vectors in a mid level of a network. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training, mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures. We demonstrate the utility of MVF for object classification across three popular datasets and multiple architectures, including both Convolutional Neural Network architectures and a Transformer architecture. We adapt MVF for action understanding with Sub-Action Modulation (SAM) for Video Networks. When humans interpret action they bring high level expectations of the context in which those actions are being performed. Along this thinking, we develop an approach to incorporating context into action understanding. Video segments are classified uniquely into a small set of action primitives (called Therbligs), which are grouped hierarchically into "Meta-Therbligs" as a context representation. SAM is an approach to first modeling Meta-Therbligs, and then incorporating expectation of Meta-Therbligs into mid-level processes through feedback. This allows the modulation of mid-level features in accordance with a temporally compositional representation of context. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness. We demonstrate the utility of SAM over four popular video understanding architectures - I3D, MoViNet, TimeSFormer, and ViViT. Experiments over EPIC Kitchens and 50 Salads on the tasks of action recognition \& anticipation demonstrate SAM produces superior accuracies across all models, tasks, and datasets with minimal architectural alterations. Lastly, we consider a notion of “feedback” where high level expectations, or specifications, are provided by human operators, allowing integration of humans into the perceptual loop . This is important for interfacing with humans, as perceptual tasks which are conventionally left entirely to human labor are increasingly (yet, thus, imperfectly) automated. We consider the task of surveillance. Security watchstanders who monitor multiple videos over long periods of time can be susceptible to information overload and fatigue. To address this, we present a configurable perception pipeline architecture, called the {\it Image Surveillance Assistant} (ISA), for assisting watchstanders with video surveillance tasks. We also present ISA$_1$, an initial implementation that can be configured with a set of {\em context specifications} which watchstanders can select or provide to indicate what imagery should generate notifications. ISA$_1$'s inputs include (1) an image and (2) context specifications, which contain English sentences and a decision boundary defined over object detection vectors. ISA$_1$ assesses the match of the image with the contexts by comparing (1) detected versus specified objects and (2) automatically-generated versus specified captions. Finally, we present a study to assess the utility of using captions in ISA$_1$, and found that they substantially improve the performance of image context detection. Finally, notions of context and the contrast used to separate context for better manipulation in the above feedback work can be of benefit not only to feedback architectures, but within feed-forward architectures as well. We apply this intuition to the task of action understanding in video, where input is separated into motion and ``context''. Motivated by Goldman's Theory of Human Action - a framework in which action decomposes into 1) base physical movements, and 2) the context in which they occur - we propose a novel learning formulation for motion and context, where context is derived as the complement to motion. More specifically, we model physical movement through the adoption of Therbligs, a set of elemental physical motions centered around object manipulation. Context is modeled through the use of a contrastive mutual information loss that formulates context information as the action information not contained within movement information. We empirically prove the utility brought by this separation of representation, showing sizable improvements in action recognition and action anticipation accuracies for a variety of models. We present results over two object manipulation datasets: EPIC Kitchens 100, and 50 Salads.Item Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes(2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.Item ADVANCED TECHNIQUES FOR RECONSTRUCTING OBJECTS AND SCENES FROM VARIATIONS IN LIGHTING AND VIEWPOINT(2024) Lichy, Daniel Jesse; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Capturing the shape and material of objects and scenes is a cornerstone of computer vision research, with significant applications across augmented reality, e-commerce, healthcare, real estate, and robotics. This thesis explores two primary capture methods: Multiview Stereo (MVS), which leverages varying viewpoints, and Photometric Stereo (PS), which utilizes changes in lighting. To address some of the limitations inherent in these techniques, we introduce several novel methods. In the first part, we present a user-friendly PS setup requiring only a camera, a flashlight, and optionally a tripod—simple enough for home assembly. To support high-resolution captures from this setup, we introduce RecNet, a novel recursive architecture trained on low-resolution synthetic data yet capable of predicting high-resolution geometry and reflectance. RecNet demonstrably outperforms state-of-the-art PS systems, even with only a few input images. Traditionally, PS assumes that lighting is distant, which is impractical for large objects or those in confined spaces. Building on RecNet, we propose a novel method that integrates per-pixel lighting estimates and recursive depth estimation to address the challenges of near-field lighting, thus broadening PS's applicability. While PS excels at capturing fine details, it often struggles with global geometry, introducing low-frequency distortions that complicate the stitching of multiple views into a complete object. Conversely, MVS captures global geometry effectively but tends to miss finer details. In the second part, we address the so-called Multiview Photometric Stereo (MVPS) problem, which leverages variations in both lighting and viewpoint. Our feedforward architecture, inspired by both MVS and PS techniques, enables geometry reconstruction that matches or exceeds the state-of-the-art in quality, while being orders of magnitude faster. In scenarios where adjusting lighting conditions is impractical, such as in large or outdoor scenes, changing viewpoints often proves more feasible, especially when cameras are mounted on mobile platforms like drones or vehicles. Large field of view (FoV) cameras are preferable for these expansive scenes, as they enable faster and easier capture. However, adapting MVS models developed for small-FoV to large-FoV requires significant modifications and traditionally depends on scarce large-FoV training data. In the third part, we introduce novel architectures and data augmentation techniques to train networks on the abundant small-FoV data but allow them to generalize to large-FoV scenarios. This approach demonstrates strong generalization capabilities across both indoor and outdoor datasets, effectively eliminating the need to acquire costly large-FoV-specific datasets for training large-FoV MVS models. Through these contributions, we aim to streamline and enhance the capture of shape and material, making it faster and more practical for a broad range of users—from casual hobbyists to industrial systems.Item SYNPLAY: IMPORTING REAL-WORLD DIVERSITY FOR A SYNTHETIC HUMAN DATASET(2024) Yim, Jinsub; Bhattacharyya, Shuvra S.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In response to the growing demand for large-scale training data, synthetic datasets have emerged as practical solutions. However, existing synthetic datasets often fall short of replicating the richness and diversity of real-world data. Synthetic Playground (SynPlay) is introduced as a new synthetic human dataset that aims to bring out the diversity of human appearance in the real world. In this thesis, We focus on two factors to achieve a level of diversity that has not yet been seen in previous works: i) realistic human motions and poses and ii) multiple camera viewpoints towards human instances. We first use a game engine and its library-provided elementary motions to create games where virtual players can take less-constrained and natural movements while following the game rules (i.e., rule-guided motion design as opposed to detail-guided design). We then augment the elementary motions with real human motions captured with a motion capture device. To render various human appearances in the games from multiple viewpoints, we use seven virtual cameras encompassing the ground and aerial views, capturing abundant aerial-vs-ground and dynamic-vs-static attributes of the scene. Through extensive and carefully-designed experiments, we show that using SynPlay in model training leads to enhanced accuracy over existing synthetic datasets for human detection and segmentation. Moreover, the benefit of SynPlay becomes even greater for tasks in the data-scarce regime, such as few-shot and cross-domain learning tasks. These results clearly demonstrate that SynPlay can be used as an essential dataset with rich attributes of complex human appearances and poses suitable for model pretraining.Item Recognizing Object-Centric Attributes and Relations(2023) Pham, Khoi; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recognizing an object's visual appearance through its attributes, such as color and shape, and its relations to other objects in an environment, is an innate human ability that allows us to effortlessly interact with the world. This ability remains effective even when humans encounter unfamiliar objects or objects with appearances evolve over time, as humans can still identify them by discerning their attributes and relations. This dissertation aims to equip computer vision systems with this capability, empowering them to recognize object's attributes and relations to become more robust in handling real-world scene complexities. The thesis is structured into two main parts. The first part focuses on recognizing attributes for objects, an area where existing research is limited to domain-specific attributes or constrained by small-scale and noisy data. We overcome these limitations by introducing a comprehensive dataset for attributes in the wild, marked by challenges with attribute diversity, label sparsity, and data imbalance. To navigate these challenges, we propose techniques that address class imbalance, employ attention mechanism, and utilize contrastive learning for aligning objects with shared attributes. However, as such dataset is expensive to collect, we also develop a framework that leverages large-scale, readily available image-text data for learning attribute prediction. The proposed framework can effectively scale up to predict a larger space of attribute concepts in real-world settings, including novel attributes represented in arbitrary text phrases that are not encountered during training. We showcase various applications of the proposed attribute prediction frameworks, including semantic image search and object image tagging with attributes. The second part delves into the understanding of visual relations between objects. First, we investigate how the interplay of attributes and relations can improve image-text matching. Moving beyond the computationally expensive cross-attention network of previous studies, we introduce a dual encoder framework using scene graphs that is more efficient yet equally powerful on current image-text retrieval benchmark. Our approach can produce scene graph embeddings rich in attribute and relation semantics, which we show to be useful for image retrieval and image tagging. Lastly, we present our work in training large vision-language models on image-text data for recognizing visual relations. We formulate a new subject-centric approach that predicts multiple relations simultaneously conditioned on a single subject. Our approach is among the first work to learn from both weakly- and strongly-grounded image-text data to predict an extensive range of relationship classes.Item Leveraging Deep Generative Models for Estimation and Recognition(2023) PNVR, Koutilya; Jacobs, David W.; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Generative models are a class of statistical models that estimate the joint probability distribution on a given observed variable and a target variable. In computer vision, generative models are typically used to model the joint probability distribution of a set of real image samples assumed to be on a complex high-dimensional image manifold. The recently proposed deep generative architectures such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models (DMs) were shown to generate photo-realistic images of human faces and other objects. These generative models also became popular for other generative tasks such as image editing, text-to-image, etc. As appealing as the perceptual quality of the generated images has become, the use of generative models for discriminative tasks such as visual recognition or geometry estimation has not been well studied. Moreover, with different kinds of powerful generative models getting popular lately, it's important to study their significance in other areas of computer vision. In this dissertation, we demonstrate the advantages of using generative models for applications that go beyond just photo-realistic image generation: Unsupervised Domain Adaptation (UDA) between synthetic and real datasets for geometry estimation; Text-based image segmentation for recognition. In the first half of the dissertation, we propose a novel generative-based UDA method for combining synthetic and real images when training networks to determine geometric information from a single image. Specifically, we use a GAN model to map both synthetic and real domains into a shared image space by translating just the domain-specific task-related information from respective domains. This is connected to a primary network for end-to-end training. Ideally, this results in images from two domains that present shared information to the primary network. Compared to previous approaches, we demonstrate an improved domain gap reduction and much better generalization between synthetic and real data for geometry estimation tasks such as monocular depth estimation and face normal estimation. In the second half of the dissertation, we showcase the power of a recent class of generative models for improving an important recognition task: text-based image segmentation. Specifically, large-scale pre-training tasks like image classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foundation models built using text-based latent diffusion techniques may learn semantic boundaries. This is because they must synthesize intricate details about all objects in an image based on a text description. Therefore, we present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.Item Activity Detection in Untrimmed Videos(2023) Gleason, Joshua D; Chellappa, Rama; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In this dissertation, we present solutions to the problem of activity detection in untrimmed videos, where we are interested in identifying both when and where various activity instances occur within an unconstrained video. Advances in machine learning, particularly the widespread adoption of deep learning-based methods have yielded robust solutions to a number of historically difficult computer vision application domains. For example, recent systems for object recognition and detection, facial identification, and a number of language processing applications have found widespread commercial success. In some cases, such systems have been able to outperform humans. The same cannot be said for the problem of activity detection in untrimmed videos. This dissertation describes our investigation and innovative solutions for the challenging problem of real-time activity detection in untrimmed videos. The main contributions of our work are the introduction of multiple novel activity detection systems that make strides toward the goal of commercially viable activity detection. The first work introduces a proposal mechanism based on divisive hierarchical clustering of objects to produce cuboid activity proposals, followed by a classification and temporal refinement step. The second work proposes a chunk-based processing mechanism and explores the tradeoff between tube and cuboid proposals. The third work explores the topic of real-time activity detection and introduces strategies for achieving this performance. The final work provides a detailed look into multiple novel extensions that improve upon the state-of-the-art in the field.