Improving Efficiency and Generalization of Visual Recognition

dc.contributor.advisorDavis, Larry Sen_US
dc.contributor.authorYu, Ruichien_US
dc.contributor.departmentComputer Scienceen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.date.accessioned2019-06-22T05:31:43Z
dc.date.available2019-06-22T05:31:43Z
dc.date.issued2018en_US
dc.description.abstractDeep Neural Networks (DNNs) are heavy in terms of their number of parameters and computational cost. This leads to two major challenges: first, training and deployment of deep networks are expensive; second, without tremendous annotated training data, which are very costly to obtain, DNNs easily suffer over-fitting and have poor generalization. We propose approaches to these two challenges in the context of specific computer vision problems to improve their efficiency and generalization. First, we study network pruning using neuron importance score propagation. To reduce the significant redundancy in DNNs, we formulate network pruning as a binary integer optimization problem which minimizes the reconstruction errors on the final responses produced by the network, and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network, then prune neurons in the entire networks jointly. Second, we study visual relationship detection (VRD) with linguistic knowledge distillation. Since the semantic space of visual relationships is huge and training data is limited, especially for long-tail relationships that have few instances, detecting visual relationships from images is a challenging problem. To improve the predictive capability, especially generalization on unseen relationships, we utilize knowledge of linguistic statistics obtained from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge) to regularize visual model learning. Third, we study the role of context selection in object detection. We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle. Based on this study, we propose a region-based context re-scoring method with dynamic context selection to remove noise and emphasize informative context. Fourth, we study the efficient relevant motion event detection for large-scale home surveillance videos. To detect motion events of objects-of-interest from large scale home surveillance videos, traditional methods based on object detection and tracking are extremely slow and require expensive GPU devices. To dramatically speedup relevant motion event detection and improve its performance, we propose a novel network for relevant motion event detection, ReMotENet, which is a unified, end-to-end data-driven method using spatial-temporal attention-based 3D ConvNets to jointly model the appearance and motion of objects-of-interest in a video. In the last part, we address the recognition of agent-in-place actions, which are associated with agents who perform them and places where they occur, in the context of outdoor home surveillance. We introduce a representation of the geometry and topology of scene layouts so that a network can generalize from the layouts observed in the training set to unseen layouts in the test set. This Layout-Induced Video Representation (LIVR) abstracts away low-level appearance variance and encodes geometric and topological relationships of places in a specific scene layout. LIVR partitions the semantic features of a video clip into different places to force the network to learn place-based feature descriptions; to predict the confidence of each action, LIVR aggregates features from the place associated with an action and its adjacent places on the scene layout. We introduce the Agent-in-Place Action dataset to show that our method allows neural network models to generalize significantly better to unseen scenes.en_US
dc.identifierhttps://doi.org/10.13016/gntr-qoyn
dc.identifier.urihttp://hdl.handle.net/1903/22155
dc.language.isoenen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledAction Recognitionen_US
dc.subject.pquncontrolledComputer Visionen_US
dc.subject.pquncontrolledDeep Learningen_US
dc.subject.pquncontrolledGeneralizationen_US
dc.subject.pquncontrolledNetwork Pruningen_US
dc.subject.pquncontrolledVisual Recognitionen_US
dc.titleImproving Efficiency and Generalization of Visual Recognitionen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yu_umd_0117E_19531.pdf
Size:
22.48 MB
Format:
Adobe Portable Document Format