Show simple item record

dc.contributor.advisorDavis, Larry Sen_US
dc.contributor.authorNg, Yue Heien_US
dc.date.accessioned2018-07-17T06:06:40Z
dc.date.available2018-07-17T06:06:40Z
dc.date.issued2018en_US
dc.identifierhttps://doi.org/10.13016/M24B2X79R
dc.identifier.urihttp://hdl.handle.net/1903/20934
dc.description.abstractVideo understanding is one of the fundamental problems in computer vision. Videos provide more information to the image recognition task by adding a temporal component through which motion and other information can be additionally used. Encouraged by the success of deep convolutional neural networks (CNNs) on image classification, we extend the deep convolutional networks to video understanding by modeling both spatial and temporal information. To effectively utilize deep networks, we need a comprehensive understanding of convolutional neural networks. We first study the network on the domain of image retrieval. We show that for instance-level image retrieval, lower layers often perform better than the last layers in convolutional neural networks. We present an approach for extracting convolutional features from different layers of the networks and adopt VLAD encoding to encode features into a single vector for each image. Our work provides guidance for transferring deep convolutional networks to other tasks. We then propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose, we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Next, we propose a multitask learning model ActionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. Experiments show that our model effectively learns video representation from motion information on unlabeled videos. While recent deep models for videos show improvement by incorporating optical flow or aggregating high-level appearance across frames, they focus on modeling either the long-term temporal relations or short-term motion. We propose Temporal Difference Networks (TDN) that model both long-term relations and short-term motion from videos. We leverage a simple but effective motion representation: difference of CNN features in our network and jointly modeling the motion at multiple scales in a single CNN.en_US
dc.language.isoenen_US
dc.titleVideo Understanding with Deep Networksen_US
dc.typeDissertationen_US
dc.contributor.publisherDigital Repository at the University of Marylanden_US
dc.contributor.publisherUniversity of Maryland (College Park, Md.)en_US
dc.contributor.departmentComputer Scienceen_US
dc.subject.pqcontrolledComputer scienceen_US
dc.subject.pquncontrolledAction Recognitionen_US
dc.subject.pquncontrolledDeep Learningen_US
dc.subject.pquncontrolledVideo Classificationen_US
dc.subject.pquncontrolledVideo Understandingen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record