Computer Science Theses and Dissertations
Permanent URI for this collectionhttp://hdl.handle.net/1903/2756
Browse
14 results
Search Results
Item Egocentric Vision in Assistive Technologies For and By the Blind(2022) Lee, Kyungjun; Kacorri, Hernisa; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Visual information in our surroundings, such as everyday objects and passersby, is often inaccessible to people who are blind. Cameras that leverage egocentric vision, in an attempt to approximate the visual field of the camera wearer, hold great promise for making the visual world more accessible for this population. Typically, such applications rely on pre-trained computer vision models and thus are limited. Moreover, as with any AI system that augments sensory abilities, conversations around ethical implications and privacy concerns lie at the core of their design and regulation. However, early efforts tend to decouple perspectives, considering only either those of the blind users or potential bystanders. In this dissertation, we revisit egocentric vision for the blind. Through a holistic approach, we examine the following dimensions: type of application (objects and passersby), camera form factor (handheld and wearable), user’s role (a passive consumer and an active director of technology), and privacy concerns (from both end-users and bystanders). Specifically, we propose to design egocentric vision models that capture blind users’ intent and are fine-tuned by the user in the context of object recognition. We seek to explore societal issues that AI-powered cameras may lead to, considering perspectives from both blind users and nearby people whose faces or objects might be captured by the cameras. Last, we investigate interactions and perceptions across different camera form factors to reveal design implications for future work.Item Object Detection and Instance Segmentation for Real-world Applications(2022) Lan, Shiyi; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The modern visual recognition system has achieved great success in the past decade. Aided by the great progress, instance localization and recognition has been significantly improved, which benefit many applications e.g. face recognition, autonomous driving, smart city etc.\ The three key factors play very important roles in the success of visual recognition, big computation, big data, and big models. Recent advances in hardware have increased the computation exponentially, which makes it feasible for training deep and large learning models on large-scale datasets. On the other hand, large-scale visual datasets e.g. ImageNet~\cite{deng2009imagenet}, COCO dataset~\cite{lin2014microsoft}, Youtube-VIS~\cite{yang2019video}, provide accurate and rich information for deep learning models. Moreover, aided by advanced design of deep neural networks~\cite{he2016deep,xie2017aggregated,liu2021swin,liu2022convnet}, the capacity of the deep models has been greatly increased. On the other hand, instance localization and recognition as the core of modern visual system has many downstream applications, e.g. autonomous driving, augmented reality, virtual reality, and smart city. Thanks to the successful advances of deep learning in the last decade, those applications have achieved such great progresses recently. In this thesis, we introduce a series of published work that improves the performance of instance localization and addresses the issues in modeling instance localization and recognition by using deep learning models. Moreover, we will introduce the future direction and some potential research projects.Item Reasoning about Geometric Object Interactions in 3D for Manipulation Action Understanding(2019) Zampogiannis, Konstantinos; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)In order to efficiently interact with human users, intelligent agents and autonomous systems need the ability of interpreting human actions. We focus our attention on manipulation actions, wherein an agent typically grasps an object and moves it, possibly altering its physical state. Agent-object and object-object interactions during a manipulation are a defining part of the performed action itself. In this thesis, we focus on extracting semantic cues, derived from geometric object interactions in 3D space during a manipulation, that are useful for action understanding at the cognitive level. First, we introduce a simple grounding model for the most common pairwise spatial relations between objects and investigate the descriptive power of their temporal evolution for action characterization. We propose a compact, abstract action descriptor that encodes the geometric object interactions during action execution, as captured by the spatial relation dynamics. Our experiments on a diverse dataset confirm both the validity and effectiveness of our spatial relation models and the discriminative power of our representation with respect to the underlying action semantics. Second, we model and detect lower level interactions, namely object contacts and separations, viewing them as topological scene changes within a dense motion estimation setting. In addition to improving motion estimation accuracy in the challenging case of motion boundaries induced by these events, our approach shows promising performance in the explicit detection and classification of the latter. Building upon dense motion estimation and using detected contact events as an attention mechanism, we propose a bottom-up pipeline for the guided segmentation and rigid motion extraction of manipulated objects. Finally, in addition to our methodological contributions, we introduce a new open-source software library for point cloud data processing, developed for the needs of this thesis, which aims at providing an easy to use, flexible, and efficient framework for the rapid development of performant software for a range of 3D perception tasks.Item Modeling Deep Context in Spatial and Temporal Domain(2018) Dai, Xiyang; Davis, Larry S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Context has been one of the most important aspects in computer vision researches because it provides useful guidance to solve variant tasks in both spatial and temporal domain. As the recent rise of deep learning methods, deep networks have shown impressive performances on many computer vision tasks. Model deep context explicitly and implicitly in deep networks can further boost the effectiveness and efficiency of deep models. In spatial domain, implicitly model context can be useful to learn discriminative texture representations. We present an effective deep fusion architecture to capture both the second order and first older statistics of texture features; Meanwhile, explicitly model context can also be important to challenging task such as fine-grained classification. We then present a deep multi-task network that explicitly captures geometry constraints by simultaneously conducting fine-grained classification and key-point localization. In temporal domain, explicitly model context can be crucial to activity recognition and localization. We present a temporal context network to explicitly capture relative context around a proposal, which samples two temporal scales pair-wisely for precise temporal localization of human activities; Meanwhile, implicitly model context can lead to better network architecture for video applications. We then present a temporal aggregation network that learns a deep hierarchical representation for capturing temporal consistency. Finally, we conduct research on jointly modeling context in both spatial and temporal domain for human action understanding, which requires to predict where, when and what a human action happens in a crowd scene. We present a decoupled framework that has dedicated branches for spatial localization and temporal recognition. Contexts in spatial and temporal branches are modeled explicitly and fused together later to generate final predictions.Item FINDING OBJECTS IN COMPLEX SCENES(2018) Sun, Jin; Jacobs, David; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Object detection is one of the fundamental problems in computer vision that has great practical impact. Current object detectors work well under certain con- ditions. However, challenges arise when scenes become more complex. Scenes are often cluttered and object detectors trained on Internet collected data fail when there are large variations in objects’ appearance. We believe the key to tackle those challenges is to understand the rich context of objects in scenes, which includes: the appearance variations of an object due to viewpoint and lighting condition changes; the relationships between objects and their typical environment; and the composition of multiple objects in the same scene. This dissertation aims to study the complexity of scenes from those aspects. To facilitate collecting training data with large variations, we design a novel user interface, ARLabeler, utilizing the power of Augmented Reality (AR) devices. Instead of labeling images from the Internet passively, we put an observer in the real world with full control over the scene complexities. Users walk around freely and observe objects from multiple angles. Lighting can be adjusted. Objects can be added and/or removed to the scene to create rich compositions. Our tool opens new possibilities to prepare data for complex scenes. We also study challenges in deploying object detectors in real world scenes: detecting curb ramps in street view images. A system, Tohme, is proposed to combine detection results from detectors and human crowdsourcing verifications. One core component is a meta-classifier that estimates the complexity of a scene and assigns it to human (accurate but costly) or computer (low cost but error-prone) accordingly. One of the insights from Tohme is that context is crucial in detecting objects. To understand the complex relationship between objects and their environment, we propose a standalone context model that predicts where an object can occur in an image. By combining this model with object detection, it can find regions where an object is missing. It can also be used to find out-of-context objects. To take a step beyond single object based detections, we explicitly model the geometrical relationships between groups of objects and use the layout information to represent scenes as a whole. We show that such a strategy is useful in retrieving indoor furniture scenes with natural language inputs.Item Detecting Objects and Actions with Deep Learning(2018) Singh, Bharat; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Deep learning based visual recognition and localization is one of the pillars of computer vision and is the driving force behind applications like self-driving cars, visual search, video surveillance, augmented reality, to name a few. This thesis identifies key bottlenecks in state-of-the-art visual recognition pipelines which use convolutional neural networks and proposes effective solutions to push their limits. A few shortcomings of convolutional neural networks are, lack of scale invariance which poses a challenge for tasks like object detection, fixed structure of the network which restricts their usage when presented with new class labels, and difficulty in modeling long range spatial/temporal dependencies. We provide evidence of these problems and then design effective solutions to overcome them. In the first part, an analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Since small and large objects are difficult to recognize at smaller and larger scales of an image pyramid respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. As SNIP ignores gradients of objects at extreme resolutions, following up on this idea, we developed SNIPER (Scale Normalization for Image Pyramids with Efficient Re-sampling), an algorithm for performing efficient multi-scale training for instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions (512x512 pixels) around ground-truth instances at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Next, we present a real-time large-scale object detector (R-FCN-3000) for detecting thousands of classes where objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. We show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes - supporting the hypothesis that it is possible to learn a universal objectness detector. Because of generalized objectness, we can train object detectors for new classes, just with classification data, without even requiring bounding boxes. Finally, we present a multi-stream bi-directional recurrent neural network for action detection. This was the first deep learning based system which could perform action localization in long videos and it could do it just with RGB data, without requiring any skeletal models or performing intermediate tasks like pose-estimation. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion while suppressing background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label and outperforms state-of-the-art methods on multiple benchmarks.Item Seeing Behind The Scene: Using Symmetry To Reason About Objects in Cluttered Environments(2017) Ecins, Aleksandrs; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Rapid advances in robotic technology are bringing robots out of the controlled environments of assembly lines and factories into the unstructured and unpredictable real-life workspaces of human beings. One of the prerequisites for operating in such environments is the ability to grasp previously unobserved physical objects. To achieve this individual objects have to be delineated from the rest of the environment and their shape properties estimated from incomplete observations of the scene. This remains a challenging task due to the lack of prior information about the shape and pose of the object as well as occlusions in cluttered scenes. We attempt to solve this problem by utilizing the powerful concept of symmetry. Symmetry is ubiquitous in both natural and man-made environments. It reveals redundancies in the structure of the world around us and thus can be used in a variety of visual processing tasks. In this thesis we propose a complete pipeline for detecting symmetric objects and recovering their rotational and reflectional symmetries from 3D reconstructions of natural scenes. We begin by obtaining a multiple-view 3D pointcloud of the scene using the Kinect Fusion algorithm. Additionally a voxelized occupancy map of the scene is extracted in order to reason about occlusions. We propose two classes of algorithms for symmetry detection: curve based and surface based. Curve based algorithm relies on extracting and matching surface normal edge curves in the pointcloud. A more efficient surface based algorithm works by fitting symmetry axes/planes to the geometry of the smooth surfaces of the scene. In order to segment the objects we introduce a segmentation approach that uses symmetry as a global grouping principle. It extracts points of the scene that are consistent with a given symmetry candidate. To evaluate the performance of our symmetry detection and segmentation algorithms we construct a dataset of cluttered tabletop scenes with ground truth object masks and corresponding symmetries. Finally we demonstrate how our pipeline can be used by a mobile robot to detect and grasp objects in a house scenario.Item Learning Visual Patterns: Imposing Order on Objects, Trajectories and Networks(2011) Farrell, Ryan; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Fundamental to many tasks in the field of computer vision, this work considers the understanding of observed visual patterns in static images and dynamic scenes . Within this broad domain, we focus on three particular subtasks, contributing novel solutions to: (a) the subordinate categorization of objects (avian species specifically), (b) the analysis of multi-agent interactions using the agent trajectories, and (c) the estimation of camera network topology. In contrast to object recognition, where the presence or absence of certain parts is generally indicative of basic-level category, the problem of subordinate categorization rests on the ability to establish salient distinctions amongst the characteristics of those parts which comprise the basic-level category. Focusing on an avian domain due to the fine-grained structure of the category taxonomy, we explore a pose-normalized appearance model based on a volumetric poselet scheme. The variation in shape and appearance properties of these parts across a taxonomy provides the cues needed for subordinate categorization. Our model associates the underlying image pattern parameters used for detection with corresponding volumetric part location, scale and orientation parameters. These parameters implicitly define a mapping from the image pixels into a pose-normalized appearance space, removing view and pose dependencies, facilitating fine-grained categorization with relatively few training examples. We next examine the problem of leveraging trajectories to understand interactions in dynamic multi-agent environments. We focus on perceptual tasks, those for which an agent's behavior is governed largely by the individuals and objects around them. We introduce kinetic accessibility, a model for evaluating the perceived, and thus anticipated, movements of other agents. This new model is then applied to the analysis of basketball footage. The kinetic accessibility measures are coupled with low-level visual cues and domain-specific knowledge for determining which player has possession of the ball and for recognizing events such as passes, shots and turnovers. Finally, we present two differing approaches for estimating camera network topology. The first technique seeks to partition a set of observations made in the camera network into individual object trajectories. As exhaustive consideration of the partition space is intractable, partitions are considered incrementally, adding observations while pruning unlikely partitions. Partition likelihood is determined by the evaluation of a probabilistic graphical model, balancing the consistency of appearances across a hypothesized trajectory with the latest predictions of camera adjacency. A primarily benefit of estimating object trajectories is that higher-order statistics, as opposed to just first-order adjacency, can be derived, yielding resilience to camera failure and the potential for improved tracking performance between cameras. Unlike the former centralized technique, the latter takes a decentralized approach, estimating the global network topology with local computations using sequential Bayesian estimation on a modified multinomial distribution. Key to this method is an information-theoretic appearance model for observation weighting. The inherently distributed nature of the approach allows the simultaneous utilization of all sensors as processing agents in collectively recovering the network topology.Item Analyzing Structured Scenarios by Tracking People and Their Limbs(2010) Morariu, Vlad Ion; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)The analysis of human activities is a fundamental problem in computer vision. Though complex, interactions between people and their environment often exhibit a spatio-temporal structure that can be exploited during analysis. This structure can be leveraged to mitigate the effects of missing or noisy visual observations caused, for example, by sensor noise, inaccurate models, or occlusion. Trajectories of people and their hands and feet, often sufficient for recognition of human activities, lead to a natural qualitative spatio-temporal description of these interactions. This work introduces the following contributions to the task of human activity understanding: 1) a framework that efficiently detects and tracks multiple interacting people and their limbs, 2) an event recognition approach that integrates both logical and probabilistic reasoning in analyzing the spatio-temporal structure of multi-agent scenarios, and 3) an effective computational model of the visibility constraints imposed on humans as they navigate through their environment. The tracking framework mixes probabilistic models with deterministic constraints and uses AND/OR search and lazy evaluation to efficiently obtain the globally optimal solution in each frame. Our high-level reasoning framework efficiently and robustly interprets noisy visual observations to deduce the events comprising structured scenarios. This is accomplished by combining First-Order Logic, Allen's Interval Logic, and Markov Logic Networks with an event hypothesis generation process that reduces the size of the ground Markov network. When applied to outdoor one-on-one basketball videos, our framework tracks the players and, guided by the game rules, analyzes their interactions with each other and the ball, annotating the videos with the relevant basketball events that occurred. Finally, motivated by studies of spatial behavior, we use a set of features from visibility analysis to represent spatial context in the interpretation of human spatial activities. We demonstrate the effectiveness of our representation on trajectories generated by humans in a virtual environment.Item Tractable Learning and Inference in High-Treewidth Graphical Models(2009) Domke, Justin; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)Probabilistic graphical models, by making conditional independence assumptions, can represent complex joint distributions in a factorized form. However, in large problems graphical models often run into two issues. First, in non-treelike graphs, computational issues frustrate exact inference. There are several approximate inference algorithms that, while often working well, do not obey approximation bounds. Second, traditional learning methods are non-robust with respect to model errors-- if the conditional independence assumptions of the model are violated, poor predictions can result. This thesis proposes two new methods for learning parameters of graphical models: implicit and procedural fitting. The goal of these methods is to improve the results of running a particular inference algorithm. Implicit fitting views inference as a large nonlinear energy function over predicted marginals. During learning, the parameters are adjusted to place the minima of this function close to the true marginals. Inspired by algorithms like loopy belief propagation, procedural fitting considers inference as a message passing procedure. Parameters are adjusted while learning so that this message-passing process gives the best results. These methods are robust to both model errors and approximate inference because learning is done directly in terms of predictive accuracy.