Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
39 results
Search Results
Item Inertially Constrained Ruled Surfaces for Visual Odometry(2024) Zhu, Chenqi; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In computer vision, camera egomotion is typically solved with visual odometry techniques that relies on feature extraction from a sequence of images and computation of the optical flow. This, however, often requires a point-to-point correspondence between two consecutive frames which can often be costly to compute and its varying accuracy greatly affects the quality of estimated motion. Attempts have been made to bypass the difficulties originated from the correspondence problem by adopting line features and fusing other sensors (event camera, IMU), many of which still heavily rely on feature detectors. If the camera observes a straight line as it moves, the image of such line is sweeping a surface, this is a ruled surface and analyzing its shapes gives information about the egomotion. This research presents a novel algorithm to estimate 3D camera egomotion from scenes represented by ruled surfaces. Constraining the egomotion with inertia measurements from an onboard IMU sensor, the dimensionality of the solution space is greatly reduced.Item Interpreting Visual Representations and Mitigating their Failures(2024) Kalibhat, Neha; Feizi, Soheil; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning has become the cornerstone of artificial intelligence (AI), particularly in language and computer vision domains. The progression in this field is reflected in numerous applications accessible to the general public, such as information retrieval via virtual assistants, content generation, autonomous vehicles, drug discovery, and medical imaging. This unprecedented rate of AI adoption raises the critical need for research on the fundamental underpinnings of deep neural networks to understand what leads to their decisions and why they fail. This thesis concentrates on self-supervised representation learning, a prevalent unsupervised method employed by foundational models to extract patterns from extensive visual data. Specifically, our focus lies in examining the low-dimensional representations generated by these models and dissecting their failure modes. In our initial investigation, we discover that self-supervised representations lack robustness to domain shifts, as they are not explicitly trained to distinguish image content from its domain. We remedy this issue by proposing a module that can be plugged into existing self-supervised baselines to disentangle their representation spaces and promote domain invariance and generalization. Our subsequent analysis delves into the patterns within representations that influence downstream classification. We scrutinize the discriminative capacity of individual features and their activations. We then propose an unsupervised quality metric that can preemptively determine whether a given representation will be correctly or incorrectly classified, with high precision. In the next segment of this thesis, we leverage our findings to further demystify the representation space, by uncovering interpretable subspaces which have unique concepts associated with them. We design a novel explainability framework that uses a vision-language model (such as CLIP) to provide natural language explanations for neural features (or groups) of a given pre-trained model. We next investigate the role of augmentations and format transformations in learning generalizable visual representations. Drawing inspiration from advancements in audio and speech modalities, we examine how presenting visual data in multiple formats affects learning, separating this from the impact of augmentations. In the final segment, we reveal compositionality as a notable failure mode in current state-of-the-art representation methods. We critique the use of fixed-size patches in vision transformers and demonstrate the benefits of employing semantically meaningful patches based on visual priors. This design adjustment leads to significant improvements in image-text retrieval tasks and, more importantly, enhances performance on compositionality benchmarks.Item Object-Attribute Compositionality for Visual Understanding(2024) Saini, Nirat; Shrivastava, Abhinav Dr; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object appearances evolve overtime, which results in visually discernible changes in their colors, shapes, sizes and materials. Humans are innately good at recognizing and understanding the evolution of object states, which is also crucial for visual understanding across images and videos. However, current vision models still struggle to capture and account for these subtle changes to recognize the objects and underlying action causing the changes. This thesis focuses on using compositional learning for recognition and generation of attribute-object pairs. In the first part, we propose to disentangle visual features for object and attributes, to generalize recognition for novel object-attribute pairs. Next, we extend this approach to learn entirely unseen attribute-object pairs, by using semantic language priors, label smoothing and propagation techniques. Further, we use object states for action recognition in videos where subtle changes in object attributes and affordances help in identifying state-modifying and context-transforming actions. All of these methods for decomposing and composing objects and states generalize to unseen pairs and out-of-domain datasets for various compositional zero-shot learning and action recognition tasks. In the second part, we propose a new benchmark suite Chop \& Learn for a novel task of Compositional Image Generation as well as discuss the implications of these approaches for other compositional tasks in images, videos, and beyond. We further extend insertion and editing of attributes of objects consistently across frames of videos, using off-the-shelf training free architecture and discuss the future challenges and opportunities of compositionality for visual understanding.Item Feedback for Vision(2024) Maynord, Michael; Aloimonos, Yiannis; Fermüller, Cornelia; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Feedback plays a prominent role in biological vision, where perception is modulated based on agents' evolving expectations and world model. This is the case both in visually understanding the static structure of the world, as well as in modeling the dynamic structure of action. In this thesis we present first an approach to incorporating controlled feedback into image understanding, second an adaptation of this approach to action understanding, and lastly a notion of feedback in video monitoring. First, we introduce a novel mechanism which modulates perception based on high level categorical expectations: Mid-Vision Feedback (MVF). MVF associates high level contexts with linear transformations. When a context is "expected" its associated linear transformation is applied over feature vectors in a mid level of a network. The result is that mid-level network representations are biased towards conformance with high level expectations, improving overall accuracy and contextual consistency. Additionally, during training, mid-level feature vectors are biased through introduction of a loss term which increases the distance between feature vectors associated with different contexts. MVF is agnostic as to the source of contextual expectations, and can serve as a mechanism for top down integration of symbolic systems with deep vision architectures. We demonstrate the utility of MVF for object classification across three popular datasets and multiple architectures, including both Convolutional Neural Network architectures and a Transformer architecture. We adapt MVF for action understanding with Sub-Action Modulation (SAM) for Video Networks. When humans interpret action they bring high level expectations of the context in which those actions are being performed. Along this thinking, we develop an approach to incorporating context into action understanding. Video segments are classified uniquely into a small set of action primitives (called Therbligs), which are grouped hierarchically into "Meta-Therbligs" as a context representation. SAM is an approach to first modeling Meta-Therbligs, and then incorporating expectation of Meta-Therbligs into mid-level processes through feedback. This allows the modulation of mid-level features in accordance with a temporally compositional representation of context. We show the superior performance of MVF to post-hoc filtering for incorporation of contextual knowledge, and show superior performance of configurations using predicted context (when no context is known a priori) over configurations with no context awareness. We demonstrate the utility of SAM over four popular video understanding architectures - I3D, MoViNet, TimeSFormer, and ViViT. Experiments over EPIC Kitchens and 50 Salads on the tasks of action recognition \& anticipation demonstrate SAM produces superior accuracies across all models, tasks, and datasets with minimal architectural alterations. Lastly, we consider a notion of “feedback” where high level expectations, or specifications, are provided by human operators, allowing integration of humans into the perceptual loop . This is important for interfacing with humans, as perceptual tasks which are conventionally left entirely to human labor are increasingly (yet, thus, imperfectly) automated. We consider the task of surveillance. Security watchstanders who monitor multiple videos over long periods of time can be susceptible to information overload and fatigue. To address this, we present a configurable perception pipeline architecture, called the {\it Image Surveillance Assistant} (ISA), for assisting watchstanders with video surveillance tasks. We also present ISA$_1$, an initial implementation that can be configured with a set of {\em context specifications} which watchstanders can select or provide to indicate what imagery should generate notifications. ISA$_1$'s inputs include (1) an image and (2) context specifications, which contain English sentences and a decision boundary defined over object detection vectors. ISA$_1$ assesses the match of the image with the contexts by comparing (1) detected versus specified objects and (2) automatically-generated versus specified captions. Finally, we present a study to assess the utility of using captions in ISA$_1$, and found that they substantially improve the performance of image context detection. Finally, notions of context and the contrast used to separate context for better manipulation in the above feedback work can be of benefit not only to feedback architectures, but within feed-forward architectures as well. We apply this intuition to the task of action understanding in video, where input is separated into motion and ``context''. Motivated by Goldman's Theory of Human Action - a framework in which action decomposes into 1) base physical movements, and 2) the context in which they occur - we propose a novel learning formulation for motion and context, where context is derived as the complement to motion. More specifically, we model physical movement through the adoption of Therbligs, a set of elemental physical motions centered around object manipulation. Context is modeled through the use of a contrastive mutual information loss that formulates context information as the action information not contained within movement information. We empirically prove the utility brought by this separation of representation, showing sizable improvements in action recognition and action anticipation accuracies for a variety of models. We present results over two object manipulation datasets: EPIC Kitchens 100, and 50 Salads.Item Supervision and Data Dynamics in Vision Across Recognition and Generation Landscapes(2024) Suri, Saksham; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This thesis looks at visual perception through the lens of supervision and data dynamics across recognition and generation landscapes. Generative and discriminative modeling form important pillars in computer vision. Depending on the task techniques to better learn and utilize the data and labels can change. Through this work we investigate different tasks along this landscape focusing on different supervision strategies, highlighting pitfalls in current approaches and propose modified architectures and losses to utilize the data better under different settings. On the recognition side we start by analyzing Vision Transformers (ViTs) through a comprehensive analysis under varied supervision paradigms. We look at a mix of explicit supervision, contrastive self-supervision, and reconstructive self-supervision by delving into attention mechanisms and learned representations. We then look at a more specific case of supervision geared towards object detection which is called sparse supervision where their are missing annotations. We propose to utilize self and semi-supervised techniques to solve this task. Finally, we also explore a discovery style framework with applications on GAN generated image detection. Unlike sparse supervision discussed earlier, this scenario handles the case where are test time we have an unknown number of new classes. We were the first work proposing this problem where instead of just identifying synthetic images, we also try to group them based on their generation source. The exploration of Generative Adversarial Networks (GANs) in an open-world scenario uncovers the intricacies of learning with limited supervision for discovery style problems. On the generation side we delve into different supervision strategies involving decomposing and decoupling representations. In the first work we tackle the problem of paired Image-to-Image (I2I) translation by decomposing supervision into reconstruction and residuals and highlight issues with traditional training approaches. We then look at generating talking head videos through two different kinds of supervision, video and audio. For driving the generation using a video we look at decoupling representations for the task of few-shot talking-head synthesis where the supervision is provided using only a few samples (shots). For this task we factorize the representation into spatial and style components which helps the learning. To supervise the generation additionally through audio, we look at multimodal supervision for lip-synchronized talking head generation. For this we incorporate audio and video modalities to synthesize lifelike talking-heads which can work even in in-the-wild scenarios. In the last part we showcase two works which link our experiences from generation and recognition where we explore generative modeling to improve recognition models. The first work here utilizes the advancements in diffusion based image generation models to improve recognition models. Given the high fidelity and control of generation which diffusion models have brought, we utilize synthetic data from these models and create a suitable pipeline to utilize this data effectively to improve detection and segmentation performance. As a follow up to our ViT analysis we also propose a new technique to utilize off the shelf pretrained ViTs and generate high resolution features using a learnt lightweight feature transform. These high resolution features are especially effective for dense tasks like correspondence, segmentation, detection and object discovery.Item ADVANCED TECHNIQUES FOR RECONSTRUCTING OBJECTS AND SCENES FROM VARIATIONS IN LIGHTING AND VIEWPOINT(2024) Lichy, Daniel Jesse; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Capturing the shape and material of objects and scenes is a cornerstone of computer vision research, with significant applications across augmented reality, e-commerce, healthcare, real estate, and robotics. This thesis explores two primary capture methods: Multiview Stereo (MVS), which leverages varying viewpoints, and Photometric Stereo (PS), which utilizes changes in lighting. To address some of the limitations inherent in these techniques, we introduce several novel methods. In the first part, we present a user-friendly PS setup requiring only a camera, a flashlight, and optionally a tripod—simple enough for home assembly. To support high-resolution captures from this setup, we introduce RecNet, a novel recursive architecture trained on low-resolution synthetic data yet capable of predicting high-resolution geometry and reflectance. RecNet demonstrably outperforms state-of-the-art PS systems, even with only a few input images. Traditionally, PS assumes that lighting is distant, which is impractical for large objects or those in confined spaces. Building on RecNet, we propose a novel method that integrates per-pixel lighting estimates and recursive depth estimation to address the challenges of near-field lighting, thus broadening PS's applicability. While PS excels at capturing fine details, it often struggles with global geometry, introducing low-frequency distortions that complicate the stitching of multiple views into a complete object. Conversely, MVS captures global geometry effectively but tends to miss finer details. In the second part, we address the so-called Multiview Photometric Stereo (MVPS) problem, which leverages variations in both lighting and viewpoint. Our feedforward architecture, inspired by both MVS and PS techniques, enables geometry reconstruction that matches or exceeds the state-of-the-art in quality, while being orders of magnitude faster. In scenarios where adjusting lighting conditions is impractical, such as in large or outdoor scenes, changing viewpoints often proves more feasible, especially when cameras are mounted on mobile platforms like drones or vehicles. Large field of view (FoV) cameras are preferable for these expansive scenes, as they enable faster and easier capture. However, adapting MVS models developed for small-FoV to large-FoV requires significant modifications and traditionally depends on scarce large-FoV training data. In the third part, we introduce novel architectures and data augmentation techniques to train networks on the abundant small-FoV data but allow them to generalize to large-FoV scenarios. This approach demonstrates strong generalization capabilities across both indoor and outdoor datasets, effectively eliminating the need to acquire costly large-FoV-specific datasets for training large-FoV MVS models. Through these contributions, we aim to streamline and enhance the capture of shape and material, making it faster and more practical for a broad range of users—from casual hobbyists to industrial systems.Item Recognizing Object-Centric Attributes and Relations(2023) Pham, Khoi; Shrivastava, Abhinav; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Recognizing an object's visual appearance through its attributes, such as color and shape, and its relations to other objects in an environment, is an innate human ability that allows us to effortlessly interact with the world. This ability remains effective even when humans encounter unfamiliar objects or objects with appearances evolve over time, as humans can still identify them by discerning their attributes and relations. This dissertation aims to equip computer vision systems with this capability, empowering them to recognize object's attributes and relations to become more robust in handling real-world scene complexities. The thesis is structured into two main parts. The first part focuses on recognizing attributes for objects, an area where existing research is limited to domain-specific attributes or constrained by small-scale and noisy data. We overcome these limitations by introducing a comprehensive dataset for attributes in the wild, marked by challenges with attribute diversity, label sparsity, and data imbalance. To navigate these challenges, we propose techniques that address class imbalance, employ attention mechanism, and utilize contrastive learning for aligning objects with shared attributes. However, as such dataset is expensive to collect, we also develop a framework that leverages large-scale, readily available image-text data for learning attribute prediction. The proposed framework can effectively scale up to predict a larger space of attribute concepts in real-world settings, including novel attributes represented in arbitrary text phrases that are not encountered during training. We showcase various applications of the proposed attribute prediction frameworks, including semantic image search and object image tagging with attributes. The second part delves into the understanding of visual relations between objects. First, we investigate how the interplay of attributes and relations can improve image-text matching. Moving beyond the computationally expensive cross-attention network of previous studies, we introduce a dual encoder framework using scene graphs that is more efficient yet equally powerful on current image-text retrieval benchmark. Our approach can produce scene graph embeddings rich in attribute and relation semantics, which we show to be useful for image retrieval and image tagging. Lastly, we present our work in training large vision-language models on image-text data for recognizing visual relations. We formulate a new subject-centric approach that predicts multiple relations simultaneously conditioned on a single subject. Our approach is among the first work to learn from both weakly- and strongly-grounded image-text data to predict an extensive range of relationship classes.Item Learning and Composing Primitives for the Visual World(2023) Gupta, Kamal; Shrivastava, Abhinav; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Compositionality is at the core of how humans understand and create visual data. In order for the computational approaches to assist humans in creative tasks, it is crucial for them to understand and perform composition. The recent advances in deep generative models have enabled us to convert noise to highly realistic scenes. However, in order to harness these models for building real-world applications, I argue that we need to be able to represent and control the generation process with the composition of interpretable primitives. In the first half of this talk, I’ll discuss how deep models can discover such primitives from visual data. By playing a cooperative referential game between two neural network agents, we can represent images with discrete meaningful concepts without supervision. I further extend this work for applications in image and video editing by learning a dense correspondence of primitives across images. In the second half, I’ll focus on learning how to compose primitives for both 2D and 3D visual data. By expressing the scenes as an assembly of smaller parts, we can easily perform generation from scratch or from partial scenes as input. I’ll conclude the talk with a discussion of possible future directions and applications of generative models, and how we can better enable users to guide the creative process.Item Dense 3D Reconstructions from Sparse Visual Data(2022) Hu, Tao; Zwicker, Matthias; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)3D reconstruction, the problem of estimating the complete geometry or appearance of objects from partial observations (e.g., several RGB images, partial shapes, videos), serves as a building block in many vision, graphics, and robotics applications such as 3D scanning, autonomous driving, 3D modeling, augmented reality (AR) and virtual reality (VR). However, it is very challenging for machines to recover 3D geometry from such sparse data due to occlusions, and irregularity and complexity of 3D objects. To solve these, in this dissertation, we explore learning-based 3D reconstruction methods for different 3D object representations on different tasks: 3D reconstructions of static objects and dynamic human body from limited data. For the 3D reconstructions of static objects, we propose a multi-view representation of 3D shapes, which utilizes a set of multi-view RGB images or depth maps to represent a 3D shape. We first explore the multi-view representation for shape completion tasks and develop deep learning methods to generate dense and high-resolution point clouds from partial observations. Yet one problem with the multi-view representation is the inconsistency among different views. To solve this problem, we propose a multi-view consistency optimization strategy to encourage consistency for shape completion in inference stage. Third, the extension of multi-view representation for dense 3D geometry and texture reconstructions from single RGB images will be presented. Capturing and rendering realistic human appearances under varying poses and viewpoints is an important goal in computer vision and graphics. In the second part, we will introduce some techniques to create 3D virtual human avatars with limited data (e.g., videos). We propose implicit representations of motion, texture, and geometry for human modeling, and utilize neural rendering techniques for free view synthesis of dynamic articulated human body. Our learned human avatars are photorealistic and fully controllable (pose, shape, viewpoints, etc.), which can be used in free-viewpoint video generation, animation, shape editing, telepresence, and AR/VR. Our proposed methods can learn end-to-end 3D reconstructions from 2D image or video signals. We hope these learning-based methods will assist in perceiving and reconstructing the 3D world for future AI systems.Item TOWARDS AUTONOMOUS DRIVING IN DENSE, HETEROGENEOUS, AND UNSTRUCTURED TRAFFIC(2022) Chandra, Rohan; Manocha, Dinesh; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)This dissertation addressed many key problems in autonomous driving towards handling dense, heterogeneous, and unstructured traffic environments. Autonomous vehicles (AV) at present are restricted to operating on smooth and well-marked roads, in sparse traffic, and among well-behaved drivers. We developed new techniques to perceive, predict, and plan among human drivers in traffic that is significantly denser in terms of number of traffic-agents, more heterogeneous in terms of size and dynamic constraints of traffic agents, and where many drivers do not follow the traffic rules. In this thesis, we present work along three themes—perception, driver behavior modeling, and planning. Our novel contributions include: 1. Improved tracking and trajectory prediction algorithms for dense and heterogeneous traffic using a combination of computer vision and deep learning techniques. 2. A novel behavior modeling approach using graph theory for characterizing human drivers as aggressive or conservative from their trajectories. 3. Behavior-driven planning and navigation algorithms in mixed (human driver and AV) and unstructured traffic environments using game theory and risk-aware control. Additionally, we have released a new traffic dataset, METEOR, which captures rare and interesting, multi-agent driving behaviors in India. These behaviors are grouped into traffic violations, atypical interactions, and diverse scenarios. We evaluate our perception work on tracking and trajectory prediction using standard autonomous driving datasets such as the Waymo Open Motion, Argoverse, NuScenes datasets, as well as public leaderboards where our tracking approach resulted in achieving rank 1 among over a 100 methods. We apply human driver behavior modeling in planning and navigation at unsignaled intersections and highways scenarios using state-of-the-art traffic simulators and show that our approach yields fewer collisions and deadlocks compared to methods based on deep reinforcement learning. We conclude the presentation with a discussion on future work.