UMD Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/3

New submissions to the thesis/dissertation collections are added automatically as they are received from the Graduate School. Currently, the Graduate School deposits all theses and dissertations from a given semester after the official graduation date. This means that there may be up to a 4 month delay in the appearance of a given thesis/dissertation in DRUM.

More information is available at Theses and Dissertations at University of Maryland Libraries.

Browse

Search Results

Now showing 1 - 10 of 21
  • Thumbnail Image
    Item
    Egocentric Vision in Assistive Technologies For and By the Blind
    (2022) Lee, Kyungjun; Kacorri, Hernisa; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Visual information in our surroundings, such as everyday objects and passersby, is often inaccessible to people who are blind. Cameras that leverage egocentric vision, in an attempt to approximate the visual field of the camera wearer, hold great promise for making the visual world more accessible for this population. Typically, such applications rely on pre-trained computer vision models and thus are limited. Moreover, as with any AI system that augments sensory abilities, conversations around ethical implications and privacy concerns lie at the core of their design and regulation. However, early efforts tend to decouple perspectives, considering only either those of the blind users or potential bystanders. In this dissertation, we revisit egocentric vision for the blind. Through a holistic approach, we examine the following dimensions: type of application (objects and passersby), camera form factor (handheld and wearable), user’s role (a passive consumer and an active director of technology), and privacy concerns (from both end-users and bystanders). Specifically, we propose to design egocentric vision models that capture blind users’ intent and are fine-tuned by the user in the context of object recognition. We seek to explore societal issues that AI-powered cameras may lead to, considering perspectives from both blind users and nearby people whose faces or objects might be captured by the cameras. Last, we investigate interactions and perceptions across different camera form factors to reveal design implications for future work.
  • Thumbnail Image
    Item
    Situated Analytics for Data Scientists
    (2022) Batch, Andrea; Elmqvist, Niklas E; Library & Information Services; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Much of Mark Weiser's vision of ``ubiquitous computing'' has come to fruition: We live in a world of interfaces that connect us with systems, devices, and people wherever we are. However, those of us in jobs that involve analyzing data and developing software find ourselves tied to environments that limit when and where we may conduct our work; it is ungainly and awkward to pull out a laptop during a stroll through a park, for example, but difficult to write a program on one's phone. In this dissertation, I discuss the current state of data visualization in data science and analysis workflows, the emerging domains of immersive and situated analytics, and how immersive and situated implementations and visualization techniques can be used to support data science. I will then describe the results of several years of my own empirical work with data scientists and other analytical professionals, particularly (though not exclusively) those employed with the U.S. Department of Commerce. These results, as they relate to visualization and visual analytics design based on user task performance, observations by the researcher and participants, and evaluation of observational data collected during user sessions, represent the first thread of research I will discuss in this dissertation. I will demonstrate how they might act as the guiding basis for my implementation of immersive and situated analytics systems and techniques. As a data scientist and economist myself, I am naturally inclined to want to use high-frequency observational data to the end of realizing a research goal; indeed, a large part of my research contributions---and a second ``thread'' of research to be presented in this dissertation---have been around interpreting user behavior using real-time data collected during user sessions. I argue that the relationship between immersive analytics and data science can and should be reciprocal: While immersive implementations can support data science work, methods borrowed from data science are particularly well-suited for supporting the evaluation of the embodied interactions common in immersive and situated environments. I make this argument based on both the ease and importance of collecting spatial data from user sessions from the sensors required for immersive systems to function that I have experienced during the course of my own empirical work with data scientists. As part of this thread of research working from this perspective, this dissertation will introduce a framework for interpreting user session data that I evaluate with user experience researchers working in the tech industry. Finally, this dissertation will present a synthesis of these two threads of research. I combine the design guidelines I derive from my empirical work with machine learning and signal processing techniques to interpret user behavior in real time in Wizualization, a mid-air gesture and speech-based augmented reality visual analytics system.
  • Thumbnail Image
    Item
    Object Detection and Instance Segmentation for Real-world Applications
    (2022) Lan, Shiyi; Davis, Larry; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    The modern visual recognition system has achieved great success in the past decade. Aided by the great progress, instance localization and recognition has been significantly improved, which benefit many applications e.g. face recognition, autonomous driving, smart city etc.\ The three key factors play very important roles in the success of visual recognition, big computation, big data, and big models. Recent advances in hardware have increased the computation exponentially, which makes it feasible for training deep and large learning models on large-scale datasets. On the other hand, large-scale visual datasets e.g. ImageNet~\cite{deng2009imagenet}, COCO dataset~\cite{lin2014microsoft}, Youtube-VIS~\cite{yang2019video}, provide accurate and rich information for deep learning models. Moreover, aided by advanced design of deep neural networks~\cite{he2016deep,xie2017aggregated,liu2021swin,liu2022convnet}, the capacity of the deep models has been greatly increased. On the other hand, instance localization and recognition as the core of modern visual system has many downstream applications, e.g. autonomous driving, augmented reality, virtual reality, and smart city. Thanks to the successful advances of deep learning in the last decade, those applications have achieved such great progresses recently. In this thesis, we introduce a series of published work that improves the performance of instance localization and addresses the issues in modeling instance localization and recognition by using deep learning models. Moreover, we will introduce the future direction and some potential research projects.
  • Thumbnail Image
    Item
    Reasoning about Geometric Object Interactions in 3D for Manipulation Action Understanding
    (2019) Zampogiannis, Konstantinos; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    In order to efficiently interact with human users, intelligent agents and autonomous systems need the ability of interpreting human actions. We focus our attention on manipulation actions, wherein an agent typically grasps an object and moves it, possibly altering its physical state. Agent-object and object-object interactions during a manipulation are a defining part of the performed action itself. In this thesis, we focus on extracting semantic cues, derived from geometric object interactions in 3D space during a manipulation, that are useful for action understanding at the cognitive level. First, we introduce a simple grounding model for the most common pairwise spatial relations between objects and investigate the descriptive power of their temporal evolution for action characterization. We propose a compact, abstract action descriptor that encodes the geometric object interactions during action execution, as captured by the spatial relation dynamics. Our experiments on a diverse dataset confirm both the validity and effectiveness of our spatial relation models and the discriminative power of our representation with respect to the underlying action semantics. Second, we model and detect lower level interactions, namely object contacts and separations, viewing them as topological scene changes within a dense motion estimation setting. In addition to improving motion estimation accuracy in the challenging case of motion boundaries induced by these events, our approach shows promising performance in the explicit detection and classification of the latter. Building upon dense motion estimation and using detected contact events as an attention mechanism, we propose a bottom-up pipeline for the guided segmentation and rigid motion extraction of manipulated objects. Finally, in addition to our methodological contributions, we introduce a new open-source software library for point cloud data processing, developed for the needs of this thesis, which aims at providing an easy to use, flexible, and efficient framework for the rapid development of performant software for a range of 3D perception tasks.
  • Thumbnail Image
    Item
    Modeling Deep Context in Spatial and Temporal Domain
    (2018) Dai, Xiyang; Davis, Larry S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Context has been one of the most important aspects in computer vision researches because it provides useful guidance to solve variant tasks in both spatial and temporal domain. As the recent rise of deep learning methods, deep networks have shown impressive performances on many computer vision tasks. Model deep context explicitly and implicitly in deep networks can further boost the effectiveness and efficiency of deep models. In spatial domain, implicitly model context can be useful to learn discriminative texture representations. We present an effective deep fusion architecture to capture both the second order and first older statistics of texture features; Meanwhile, explicitly model context can also be important to challenging task such as fine-grained classification. We then present a deep multi-task network that explicitly captures geometry constraints by simultaneously conducting fine-grained classification and key-point localization. In temporal domain, explicitly model context can be crucial to activity recognition and localization. We present a temporal context network to explicitly capture relative context around a proposal, which samples two temporal scales pair-wisely for precise temporal localization of human activities; Meanwhile, implicitly model context can lead to better network architecture for video applications. We then present a temporal aggregation network that learns a deep hierarchical representation for capturing temporal consistency. Finally, we conduct research on jointly modeling context in both spatial and temporal domain for human action understanding, which requires to predict where, when and what a human action happens in a crowd scene. We present a decoupled framework that has dedicated branches for spatial localization and temporal recognition. Contexts in spatial and temporal branches are modeled explicitly and fused together later to generate final predictions.
  • Thumbnail Image
    Item
    FINDING OBJECTS IN COMPLEX SCENES
    (2018) Sun, Jin; Jacobs, David; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Object detection is one of the fundamental problems in computer vision that has great practical impact. Current object detectors work well under certain con- ditions. However, challenges arise when scenes become more complex. Scenes are often cluttered and object detectors trained on Internet collected data fail when there are large variations in objects’ appearance. We believe the key to tackle those challenges is to understand the rich context of objects in scenes, which includes: the appearance variations of an object due to viewpoint and lighting condition changes; the relationships between objects and their typical environment; and the composition of multiple objects in the same scene. This dissertation aims to study the complexity of scenes from those aspects. To facilitate collecting training data with large variations, we design a novel user interface, ARLabeler, utilizing the power of Augmented Reality (AR) devices. Instead of labeling images from the Internet passively, we put an observer in the real world with full control over the scene complexities. Users walk around freely and observe objects from multiple angles. Lighting can be adjusted. Objects can be added and/or removed to the scene to create rich compositions. Our tool opens new possibilities to prepare data for complex scenes. We also study challenges in deploying object detectors in real world scenes: detecting curb ramps in street view images. A system, Tohme, is proposed to combine detection results from detectors and human crowdsourcing verifications. One core component is a meta-classifier that estimates the complexity of a scene and assigns it to human (accurate but costly) or computer (low cost but error-prone) accordingly. One of the insights from Tohme is that context is crucial in detecting objects. To understand the complex relationship between objects and their environment, we propose a standalone context model that predicts where an object can occur in an image. By combining this model with object detection, it can find regions where an object is missing. It can also be used to find out-of-context objects. To take a step beyond single object based detections, we explicitly model the geometrical relationships between groups of objects and use the layout information to represent scenes as a whole. We show that such a strategy is useful in retrieving indoor furniture scenes with natural language inputs.
  • Thumbnail Image
    Item
    Detecting Objects and Actions with Deep Learning
    (2018) Singh, Bharat; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Deep learning based visual recognition and localization is one of the pillars of computer vision and is the driving force behind applications like self-driving cars, visual search, video surveillance, augmented reality, to name a few. This thesis identifies key bottlenecks in state-of-the-art visual recognition pipelines which use convolutional neural networks and proposes effective solutions to push their limits. A few shortcomings of convolutional neural networks are, lack of scale invariance which poses a challenge for tasks like object detection, fixed structure of the network which restricts their usage when presented with new class labels, and difficulty in modeling long range spatial/temporal dependencies. We provide evidence of these problems and then design effective solutions to overcome them. In the first part, an analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Since small and large objects are difficult to recognize at smaller and larger scales of an image pyramid respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. As SNIP ignores gradients of objects at extreme resolutions, following up on this idea, we developed SNIPER (Scale Normalization for Image Pyramids with Efficient Re-sampling), an algorithm for performing efficient multi-scale training for instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions (512x512 pixels) around ground-truth instances at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Next, we present a real-time large-scale object detector (R-FCN-3000) for detecting thousands of classes where objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. We show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes - supporting the hypothesis that it is possible to learn a universal objectness detector. Because of generalized objectness, we can train object detectors for new classes, just with classification data, without even requiring bounding boxes. Finally, we present a multi-stream bi-directional recurrent neural network for action detection. This was the first deep learning based system which could perform action localization in long videos and it could do it just with RGB data, without requiring any skeletal models or performing intermediate tasks like pose-estimation. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion while suppressing background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label and outperforms state-of-the-art methods on multiple benchmarks.
  • Thumbnail Image
    Item
    Seeing Behind The Scene: Using Symmetry To Reason About Objects in Cluttered Environments
    (2017) Ecins, Aleksandrs; Aloimonos, Yiannis; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Rapid advances in robotic technology are bringing robots out of the controlled environments of assembly lines and factories into the unstructured and unpredictable real-life workspaces of human beings. One of the prerequisites for operating in such environments is the ability to grasp previously unobserved physical objects. To achieve this individual objects have to be delineated from the rest of the environment and their shape properties estimated from incomplete observations of the scene. This remains a challenging task due to the lack of prior information about the shape and pose of the object as well as occlusions in cluttered scenes. We attempt to solve this problem by utilizing the powerful concept of symmetry. Symmetry is ubiquitous in both natural and man-made environments. It reveals redundancies in the structure of the world around us and thus can be used in a variety of visual processing tasks. In this thesis we propose a complete pipeline for detecting symmetric objects and recovering their rotational and reflectional symmetries from 3D reconstructions of natural scenes. We begin by obtaining a multiple-view 3D pointcloud of the scene using the Kinect Fusion algorithm. Additionally a voxelized occupancy map of the scene is extracted in order to reason about occlusions. We propose two classes of algorithms for symmetry detection: curve based and surface based. Curve based algorithm relies on extracting and matching surface normal edge curves in the pointcloud. A more efficient surface based algorithm works by fitting symmetry axes/planes to the geometry of the smooth surfaces of the scene. In order to segment the objects we introduce a segmentation approach that uses symmetry as a global grouping principle. It extracts points of the scene that are consistent with a given symmetry candidate. To evaluate the performance of our symmetry detection and segmentation algorithms we construct a dataset of cluttered tabletop scenes with ground truth object masks and corresponding symmetries. Finally we demonstrate how our pipeline can be used by a mobile robot to detect and grasp objects in a house scenario.
  • Thumbnail Image
    Item
    Recognizing Visual Categories by Commonality and Diversity
    (2015) Choi, Jonghyun; Davis, Larry Steven; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Visual categories refer to categories of objects or scenes in the computer vision literature. Building a well-performing classifier for visual categories is challenging as it requires a high level of generalization as the categories have large within class variability. We present several methods to build generalizable classifiers for visual categories by exploiting commonality and diversity of labeled samples and the cat- egory definitions to improve category classification accuracy. First, we describe a method to discover and add unlabeled samples from auxil- iary sources to categories of interest for building better classifiers. In the literature, given a pool of unlabeled samples, the samples to be added are usually discovered based on low level visual signatures such as edge statistics or shape or color by an unsupervised or semi-supervised learning framework. This method is inexpensive as it does not require human intervention, but generally does not provide useful information for accuracy improvement as the selected samples are visually similar to the existing set of samples. The samples added by active learning, on the other hand, provide different visual aspects to categories and contribute to learning a better classifier, but are expensive as they need human labeling. To obtain high quality samples with less annotation cost, we present a method to discover and add samples from unlabeled image pools that are visually diverse but coherent to cat- egory definition by using higher level visual aspects, captured by a set of learned attributes. The method significantly improves the classification accuracy over the baselines without human intervention. Second, we describe now to learn an ensemble of classifiers that captures both commonly shared information and diversity among the training samples. To learn such ensemble classifiers, we first discover discriminative sub-categories of the la- beled samples for diversity. We then learn an ensemble of discriminative classifiers with a constraint that minimizes the rank of the stacked matrix of classifiers. The resulting set of classifiers both share the category-wide commonality and preserve diversity of subcategories. The proposed ensemble classifier improves recognition accuracy significantly over the baselines and state-of-the-art subcategory based en- semble classifiers, especially for the challenging categories. Third, we explore the commonality and diversity of semantic relationships of category definitions to improve classification accuracy in an efficient manner. Specif- ically, our classification model identifies the most helpful relational semantic queries to discriminatively refine the model by a small amount of semantic feedback in inter- active iterations. We improve the classification accuracy on challenging categories that have very small numbers of training samples via transferred knowledge from other related categories that have a lager number of training samples by solving a semantically constrained transfer learning optimization problem. Finally, we summarize ideas presented and discuss possible future work.
  • Thumbnail Image
    Item
    FEATURE LEARNING AND ACTIVE LEARNING FOR IMAGE QUALITY ASSESSMENT
    (2014) Ye, Peng; Chellappa, Rama; Doermann, David; Electrical Engineering; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    With the increasing popularity of mobile imaging devices, digital images have become an important vehicle for representing and communicating information. Unfortunately, digital images may be degraded at various stages of their life cycle. These degradations may lead to the loss of visual information, resulting in an unsatisfactory experience for human viewers and difficulties for image processing and analysis at subsequent stages. The problem of visual information quality assessment plays an important role in numerous image/video processing and computer vision applications, including image compression, image transmission and image retrieval, etc. There are two divisions of Image Quality Assessment (IQA) research - Objective IQA and Subjective IQA. For objective IQA, the goal is to develop a computational model that can predict the quality of distorted image with respect to human perception or other measures of interest accurately and automatically. For subjective IQA, the goal is to design experiments for acquiring human subjects' opinions on image quality. It is often used to construct image quality datasets and provide the groundtruth for building and evaluating objective quality measures. In the thesis, we will address these two aspects of IQA problem. For objective IQA, our work focuses on the most challenging category of objective IQA tasks - general-purpose No-Reference IQA (NR-IQA), where the goal is to evaluate the quality of digital images without access to reference images and without prior knowledge of the types of distortions. First, we introduce a feature learning framework for NR-IQA. Our method learns discriminative visual features in the spatial domain instead of using hand-craft features. It can therefore significantly reduce the feature computation time compared to previous state-of-the-art approaches while achieving state-of-the-art performance in prediction accuracy. Second, we present an effective method for extending existing NR-IQA mod- els to "Opinion-Free" (OF) models which do not require human opinion scores for training. In particular, we accomplish this by using Full-Reference (FR) IQA measures to train NR-IQA models. Unsupervised rank aggregation is applied to combine different FR measures to generate a synthetic score, which serves as a better "gold standard". Our method significantly outperforms previous OF-NRIQA methods and is comparable to state-of-the-art NR-IQA methods trained on human opinion scores. Unlike objective IQA, subjective IQA tests ask humans to evaluate image quality and are generally considered as the most reliable way to evaluate the visual quality of digital images perceived by the end user. We present a hybrid subjective test which combines Absolute Categorical Rating (ACR) tests and Paired Comparison (PC) tests via a unified probabilistic model and an active sampling method. Our method actively constructs a set of queries consisting of ACR and PC tests based on the expected information gain provided by each test and can effectively reduce the number of tests required for achieving a target accuracy. Our method can be used in conventional laboratory studies as well as crowdsourcing experiments. Experimental results show our method outperforms state-of-the-art subjective IQA tests in a crowdsourced setting.