Detecting Objects and Actions with Deep Learning

Singh, Bharat

Detecting Objects and Actions with Deep Learning

Files

Singh_umd_0117E_19349.pdf (6.07 MB)

No. of downloads: 889

Date

2018

Authors

Singh, Bharat

Advisor

Davis, Larry S

DRUM DOI

https://doi.org/10.13016/M2FX7423N

Abstract

Deep learning based visual recognition and localization is one of the pillars of computer vision and is the driving force behind applications like self-driving cars, visual search, video surveillance, augmented reality, to name a few. This thesis identifies key bottlenecks in state-of-the-art visual recognition pipelines which use convolutional neural networks and proposes effective solutions to push their limits. A few shortcomings of convolutional neural networks are, lack of scale invariance which poses a challenge for tasks like object detection, fixed structure of the network which restricts their usage when presented with new class labels, and difficulty in modeling long range spatial/temporal dependencies. We provide evidence of these problems and then design effective solutions to overcome them.

In the first part, an analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Since small and large objects are difficult to recognize at smaller and larger scales of an image pyramid respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. As SNIP ignores gradients of objects at extreme resolutions, following up on this idea, we developed SNIPER (Scale Normalization for Image Pyramids with Efficient Re-sampling), an algorithm for performing efficient multi-scale training for instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions (512x512 pixels) around ground-truth instances at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct.





Next, we present a real-time large-scale object detector (R-FCN-3000) for detecting thousands of classes where objectness detection and classification are decoupled. To obtain the detection score for an RoI, we multiply the objectness score with the fine-grained classification score. We show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes - supporting the hypothesis that it is possible to learn a universal objectness detector. Because of generalized objectness, we can train object detectors for new classes, just with classification data, without even requiring bounding boxes.



Finally, we present a multi-stream bi-directional recurrent neural network for action detection. This was the first deep learning based system which could perform action localization in long videos and it could do it just with RGB data, without requiring any skeletal models or performing intermediate tasks like pose-estimation. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion while suppressing background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label and outperforms state-of-the-art methods on multiple benchmarks.

URI (handle)

http://hdl.handle.net/1903/21149

Collections

UMD Theses and Dissertations
Computer Science Theses and Dissertations

Full item page