Improving Efficiency and Generalization of Visual Recognition
dc.contributor.advisor | Davis, Larry S | en_US |
dc.contributor.author | Yu, Ruichi | en_US |
dc.contributor.department | Computer Science | en_US |
dc.contributor.publisher | Digital Repository at the University of Maryland | en_US |
dc.contributor.publisher | University of Maryland (College Park, Md.) | en_US |
dc.date.accessioned | 2019-06-22T05:31:43Z | |
dc.date.available | 2019-06-22T05:31:43Z | |
dc.date.issued | 2018 | en_US |
dc.description.abstract | Deep Neural Networks (DNNs) are heavy in terms of their number of parameters and computational cost. This leads to two major challenges: first, training and deployment of deep networks are expensive; second, without tremendous annotated training data, which are very costly to obtain, DNNs easily suffer over-fitting and have poor generalization. We propose approaches to these two challenges in the context of specific computer vision problems to improve their efficiency and generalization. First, we study network pruning using neuron importance score propagation. To reduce the significant redundancy in DNNs, we formulate network pruning as a binary integer optimization problem which minimizes the reconstruction errors on the final responses produced by the network, and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network, then prune neurons in the entire networks jointly. Second, we study visual relationship detection (VRD) with linguistic knowledge distillation. Since the semantic space of visual relationships is huge and training data is limited, especially for long-tail relationships that have few instances, detecting visual relationships from images is a challenging problem. To improve the predictive capability, especially generalization on unseen relationships, we utilize knowledge of linguistic statistics obtained from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge) to regularize visual model learning. Third, we study the role of context selection in object detection. We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle. Based on this study, we propose a region-based context re-scoring method with dynamic context selection to remove noise and emphasize informative context. Fourth, we study the efficient relevant motion event detection for large-scale home surveillance videos. To detect motion events of objects-of-interest from large scale home surveillance videos, traditional methods based on object detection and tracking are extremely slow and require expensive GPU devices. To dramatically speedup relevant motion event detection and improve its performance, we propose a novel network for relevant motion event detection, ReMotENet, which is a unified, end-to-end data-driven method using spatial-temporal attention-based 3D ConvNets to jointly model the appearance and motion of objects-of-interest in a video. In the last part, we address the recognition of agent-in-place actions, which are associated with agents who perform them and places where they occur, in the context of outdoor home surveillance. We introduce a representation of the geometry and topology of scene layouts so that a network can generalize from the layouts observed in the training set to unseen layouts in the test set. This Layout-Induced Video Representation (LIVR) abstracts away low-level appearance variance and encodes geometric and topological relationships of places in a specific scene layout. LIVR partitions the semantic features of a video clip into different places to force the network to learn place-based feature descriptions; to predict the confidence of each action, LIVR aggregates features from the place associated with an action and its adjacent places on the scene layout. We introduce the Agent-in-Place Action dataset to show that our method allows neural network models to generalize significantly better to unseen scenes. | en_US |
dc.identifier | https://doi.org/10.13016/gntr-qoyn | |
dc.identifier.uri | http://hdl.handle.net/1903/22155 | |
dc.language.iso | en | en_US |
dc.subject.pqcontrolled | Computer science | en_US |
dc.subject.pquncontrolled | Action Recognition | en_US |
dc.subject.pquncontrolled | Computer Vision | en_US |
dc.subject.pquncontrolled | Deep Learning | en_US |
dc.subject.pquncontrolled | Generalization | en_US |
dc.subject.pquncontrolled | Network Pruning | en_US |
dc.subject.pquncontrolled | Visual Recognition | en_US |
dc.title | Improving Efficiency and Generalization of Visual Recognition | en_US |
dc.type | Dissertation | en_US |
Files
Original bundle
1 - 1 of 1