Computer Science Theses and Dissertations

Permanent URI for this collectionhttp://hdl.handle.net/1903/2756

Browse

Search Results

Now showing 1 - 6 of 6
  • Thumbnail Image
    Item
    Efficient Detection of Objects and Faces with Deep Learning
    (2020) Najibi, Mahyar; Davis, Larry S.; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Object detection is a fundamental problem in computer vision and is an essential building block for many applications such as autonomous driving, visual search, and object tracking. Given its large-scale and real-time applications, scalable training and fast inference are critical. Deep neural networks, although powerful in visual recognition, can be computationally expensive. Besides, they introduce shortcomings such as lack of scale-invariance and inaccurate predictions in crowded scenes that can affect detection. This dissertation studies the intrinsic problems which emerge when deep convolutional neural networks are used for object and face detection. We introduce methods to overcome these issues which are not only accurate but also efficient. First, we focus on the problem of lack of scale-invariance. Performing inference on a multi-scale image pyramid, although effective, increases computation noticeably. Moreover, multi-scale inference really blooms when the model is also trained using expensive multi-scale approaches. As a result, we start by introducing an efficient multi-scale training algorithm called "SNIPER" (Scale Normalization for Image Pyramids with Efficient Re-sampling). Based on the ground-truth annotations, SNIPER sparsely samples high-resolution image regions wherever needed. In contrast to training, at inference, there is no ground-truth information to guide region sampling. Thus, we propose "AutoFocus". AutoFocus predicts regions to be zoomed-in from low resolutions at inference time, making it possible to skip a large portion of the input pyramid. While being as efficient as single-scale detectors, these methods boost performance noticeably. Second, we study the problem of efficient face detection. Compared to generic objects, faces are rigid and crowded scenes containing hundreds of faces with extreme scales are more common. In this dissertation, we present "SSH" (Single Stage Headless Face Detector). A method that unlike two-stage localization/classification detectors, performs both tasks in a single stage, efficiently models scale variation by design, and removes most of the parameters from its underlying network, but still achieves state-of-the-art results on challenging benchmarks. Furthermore, for the two-stage detection paradigm, we introduce "FA-RPN" (Floating Anchor Region Proposal Network). FA-RPN takes the spatial structure of faces into account and allows modification of the prediction density during inference to efficiently deal with crowded scenes. Finally, we turn our attention to the first step in two-stage localization/classification detectors. While neural networks were deployed for classification, localization was previously solved using classic algorithms which became the bottleneck. To remedy, we propose "G-CNN" which models localization as a search in the space of all possible bounding boxes and deploys the same neural network used for classification. Furthermore, for tasks such as saliency detection, where the number of predictions is typically small, we develop an alternative approach that runs at speeds close to 120 frames/second.
  • Thumbnail Image
    Item
    Improving Efficiency for Object Detection and Temporal Modeling for Action Localization
    (2019) Gao, Mingfei; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Despite their great predictive capability, Convolutional Neural Networks (CNNs) are computational-expensive to deploy and usually require a tremendous amount of annotated data at training time. When analyzing videos, it is very important and challenging to model temporal dynamics due to large appearance variation and complex semantics. We propose methods to improve efficiency of model deployment for object detection in images and to capture temporal dependencies for online action detection in videos. To relieve the demand of human labor for data annotation, we introduce approaches to conduct object detection and natural language localization using weak supervisions. First, we introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in. Second, we propose a novel framework, Temporal Recurrent Network (TRN), to model greater temporal context of a video frame by simultaneously performing online action detection and anticipation of the immediate future. At each moment in time, our approach makes use of both accumulated historical evidence and predicted future information to better recognize the action that is currently occurring, and integrates both of these into a unified end-to-end architecture. We evaluate our approach on two popular online action detection datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS’14. Third, we propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos. Our method decomposes ODAS into two stages: action classification (using ClsNet) and start point localization (using LocNet). ClsNet focuses on per-frame labeling and predicts action score distributions online. Based on the predicted action scores of the past and current frames, LocNet conducts class-agnostic start detection by optimizing long-term localization rewards using policy gradient methods. The proposed framework is validated on two large-scale datasets, THUMOS’14 and ActivityNet. Fourth, we introduce Count-guided Weakly Supervised Localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve Weakly Supervised Localization (WSL). C-WSL uses a simple count-based region selection algorithm to select high-quality regions, each of which covers a single object instance during training, and improves existing WSL methods by training with the selected regions. To demonstrate the effectiveness of C-WSL, we integrate it into two WSL architectures and conduct extensive experiments on VOC2007 and VOC2012. In the last, we propose Weakly Supervised Language Localization Networks (WSLLN) to detect events in long, untrimmed videos given language queries. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments are conducted on ActivityNet Captions and DiDeMo.
  • Thumbnail Image
    Item
    Context Driven Scene Understanding
    (2015) Chen, Xi; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Understanding objects in complex scenes is a fundamental and challenging problem in computer vision. Given an image, we would like to answer the questions of whether there is an object of a particular category in the image, where is it, and if possible, locate it with a bounding box or pixel-wise labels. In this dissertation, we present context driven approaches leveraging relationships between objects in the scene to improve both the accuracy and efficiency of scene understanding. In the first part, we describe an approach to jointly solve the segmentation and recognition problem using a multiple segmentation framework with context. Our approach formulates a cost function based on contextual information in conjunction with appearance matching. This relaxed cost function formulation is minimized using an efficient quadratic programming solver and an approximate solution is obtained by discretizing the relaxed solution. Our approach improves labeling performance compared to other segmentation based recognition approaches. Secondly, we introduce a new problem called object co-labeling where the goal is to jointly annotate multiple images of the same scene which do not have temporal consistency. We present an adaptive framework for joint segmentation and recognition to solve this problem. We propose an objective function that considers not only appearance but also appearance and context consistency across images of the scene. A relaxed form of the cost function is minimized using an efficient quadratic programming solver. Our approach improves labeling performance compared to labeling each image individually. We also show the application of our co-labeling framework to other recognition problems such as label propagation in videos and object recognition in similar scenes. In the third part, we propose a novel general strategy for simultaneous object detection and segmentation. Instead of passively evaluating all object detectors at all possible locations in an image, we develop a divide-and-conquer approach by actively and sequentially evaluating contextual cues related to the query based on the scene and previous evaluations---like playing a ``20 Questions'' game---to decide where to search for the object. Such questions are dynamically selected based on the query, the scene and current observed responses given by object detectors and classifiers. We first present an efficient object search policy based on information gain of asking a question. We formulate the policy in a probabilistic framework that integrates current information and observation to update the model and determine the next most informative action to take next. We further enrich the power and generalization capacity of the Twenty Questions strategy by learning the Twenty Questions policy driven by data. We formulate the problem as a Markov Decision Process and learn a search policy by imitation learning.
  • Thumbnail Image
    Item
    Understanding Objects in the Visual World
    (2015) Ahmed, Ejaz; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    One way to understand the visual world is by reasoning about the objects present in it: their type, their location, their similarities, their layout etc. Despite several successes, detailed recognition remains a challenging tasks for current computer vision systems. This dissertation focuses on building systems that improve on the state-of-the-art on several fronts. On one hand, we propose better representations of visual categories that enable more accurate reasoning about their properties. To learn such representations, we employ machine learning methods that leverage the power of big-data. On the other hand, we present solutions to make current frameworks more efficient without losing on performance. The first part of the dissertation focuses on improvements in efficiency. We first introduce a fast automated mechanism for selecting a diverse set of discriminative filters and show that one can efficiently learn a universal model of filter "goodness" based on properties of the filter itself. As an alternative to the expensive evaluation of filters, which is often the bottleneck in many techniques, our method has the potential of dramatically altering the trade-off between the accuracy of a filter based method and the cost of training. Second, we present a method for linear dimensionality reduction which we call composite discriminant factor analysis (CDF). CDF searches for a discriminative but compact feature subspace in which the classifiers can be trained, leading to an order of magnitude saving in detection time. In the second part, we focus on the problem of person re-identification, an important component of surveillance systems. We present a deep learning architecture that simultaneously learns features and computes their corresponding similarity metric. Given a pair of images as input, our network outputs a similarity value indicating whether the two input images depict the same person. We propose new layers which capture local relationships among mid-level features, produce a high-level summary of these relationships and spatially integrate them to give a holistic representation. In the final part, we present a semantic object selection framework that uses natural language input to perform image editing. In the general context of interactive object segmentation, many of the methods that utilize user input (such as mouse clicks and mouse strokes) often require significant user intervention. In this work, we present a system with a far simpler input method: the user only needs to give the name of the desired object. For this problem we present a solution which borrows ideas from image retrieval, segmentation propagation, object localization and convolution neural networks.
  • Thumbnail Image
    Item
    DOMAIN ADAPTIVE OBJECT RECOGNITION AND DETECTION
    (2013) Mirrashed, Fatemeh; Davis, Larry S; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    Discriminative learning algorithms rely on the assumption that training and test data are drawn from the same marginal probability distribution. In real world applications, however, this assumption is often violated and results in a significant performance drop. We often have sufficient labeled training data from single or multiple "source" domains but wish to learn a classifier which performs well on a "target" domain with a different distribution and no labeled training data. In visual object detection, for example, where the goal is to locate the objects of interest in a given image, it may be infeasible to collect training data to model the enormous variety of possible combinations of pose, background, resolution, and lighting conditions affecting object appearance. Thus, we generally expect to encounter instances or domains at test time for which we have seen little or no training data. To this end, we first propose a framework for domain adaptive object recognition and detection using Transfer Component Analysis, an unsupervised domain adaptation and dimensionality reduction technique. The idea is to obtain a transformation in feature space to a latent subspace that reduces the distance between the source and target data distributions. We evaluate the effectiveness of this approach for vehicle detection using video frames from 50 different surveillance cameras. Next, we explore the problem of extreme class imbalance present when performing fully unsupervised domain adaptation for object detection. The main challenge arises from the fact that images in unconstrained settings are mostly occupied by the background (negative class). Therefore, random sampling will not be effective in obtaining a sufficient number of positive samples from the target domain, which is required by any adaptation method. We propose a variation of co-learning technique that automatically constructs a more balanced set of samples from the target domain. We compare the performance of our technique with other approaches such as unbiased learning from multiple datasets and self-learning. Finally, we propose a novel approach for unsupervised domain adaptation. Our method learns a set of binary attributes for classification that captures the structural information of the data distribution in the target domain itself. The key insight is finding attributes that are discriminative across categories and predictable across domains. We formulate our optimization problem to learn these attributes and the classifier jointly. We evaluate the performance of our method on a wide range of tasks including cross-domain object recognition and sentiment analysis on textual data both in inductive and transductive settings. We achieve a performance that significantly exceeds the state-of-the-art results on standard benchmarks. In many cases we reach the same-domain performance, the upper bound, in unsupervised domain adaptation scenarios.
  • Thumbnail Image
    Item
    Resource Allocation in Computer Vision
    (2013) Chen, Daozheng; Jacobs, David W; Computer Science; Digital Repository at the University of Maryland; University of Maryland (College Park, Md.)
    We broadly examine resource allocation in several computer vision problems. We consider human resource or computational resource constraints. Human resources, such as human operators monitoring a camera network, provide reliable information, but are typically limited by the huge amount of data to be processed. Computational resources refer to the resources used by machines, such as running time, to execute the programs. It is important to develop algorithms to make effective use of these resources in computer vision applications. We approach human resource constraints with a frame retrieval problem in a camera network. This work addresses the problem of using active inference to direct human attention in searching a camera network for people that match a query image. We find that by representing the camera network using a graphical model, we can more accurately determine which video frames match the query, and improve our ability to direct human attention. We experiment with different methods to determine from which frames to sample expert information from humans, and discover that a method that learns to predict which frame is misclassified gives the best performance. We approach the problem of allocating computational resource in a video processing task. We consider a video processing application in which we combine the outputs from two algorithms so that the budget-limited computationally more expensive algorithm is run in the most useful video frames to maximize processing performance. We model the video frames as a chain graphical model and extend a dynamic programming algorithm to determine on which frames to run the more expensive algorithm. We perform experiments on moving object detection and face detection to demonstrate the effectiveness of our approaches. Finally, we consider an idea for saving computational resources and maintaining program performance. We work on a problem of learning model complexity in latent variable models. Specifically, we learn the latent variable state space complexity in latent support vector machines using group norm regularization. We apply our method to handwritten digit recognition and object detection with deformable part models. Our approach reduces latent variable state size and performs faster inference with similar or better performance.