Improving Efficiency for Object Detection and Temporal Modeling for Action Localization

Thumbnail Image


Publication or External Link





Despite their great predictive capability, Convolutional Neural Networks (CNNs) are computational-expensive to deploy and usually require a tremendous amount of annotated data at training time. When analyzing videos, it is very important and challenging to model temporal dynamics due to large appearance variation and complex semantics. We propose methods to improve efficiency of model deployment for object detection in images and to capture temporal dependencies for online action detection in videos. To relieve the demand of human labor for data annotation, we introduce approaches to conduct object detection and natural language localization using weak supervisions.

First, we introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in.

Second, we propose a novel framework, Temporal Recurrent Network (TRN), to model greater temporal context of a video frame by simultaneously performing online action detection and anticipation of the immediate future. At each moment in time, our approach makes use of both accumulated historical evidence and predicted future information to better recognize the action that is currently occurring, and integrates both of these into a unified end-to-end architecture. We evaluate our approach on two popular online action detection datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS’14.

Third, we propose StartNet to address Online Detection of Action Start (ODAS) where action starts and their associated categories are detected in untrimmed, streaming videos. Our method decomposes ODAS into two stages: action classification (using ClsNet) and start point localization (using LocNet). ClsNet focuses on per-frame labeling and predicts action score distributions online. Based on the predicted action scores of the past and current frames, LocNet conducts class-agnostic start detection by optimizing long-term localization rewards using policy gradient methods. The proposed framework is validated on two large-scale datasets, THUMOS’14 and ActivityNet.

Fourth, we introduce Count-guided Weakly Supervised Localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve Weakly Supervised Localization (WSL). C-WSL uses a simple count-based region selection algorithm to select high-quality regions, each of which covers a single object instance during training, and improves existing WSL methods by training with the selected regions. To demonstrate the effectiveness of C-WSL, we integrate it into two WSL architectures and conduct extensive experiments on VOC2007 and VOC2012.

In the last, we propose Weakly Supervised Language Localization Networks (WSLLN) to detect events in long, untrimmed videos given language queries. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments are conducted on ActivityNet Captions and DiDeMo.