IMAGE AND VIDEO UNDERSTANDING WITH CONSTRAINED RESOURCES
Publication or External Link
Recent advances in computer vision tasks have been driven by high-capacity deep neural networks, particularly Convolutional Neural Networks (CNNs) with hundreds of layers trained in a supervised manner. However, this poses two significant challenges: (1) the increased depth in CNNs that leads to significant improvements over competitive benchmarks at the same time, limits their deployment in real-world scenarios due to high computational cost, (2) the need to collect millions of human labeled samples for training prevents such approaches to scale, especially for fine-grained image understanding like semantic segmentation, where dense annotations are extremely expensive to obtain. To mitigate these issues, we focus on image and video understanding with constrained resources, in the forms of computational resources and annotation resources. In particular, we present approaches that (1) investigate dynamic computation frameworks which adaptively allocate computing resources on-the-fly given a novel image/video to manage the trade-off between accuracy and computational complexity; (2) derive robust representations with minimal human supervision through exploring context relationships or using shared information across domains.
With this in mind, we first introduce BlockDrop, a conditional computation approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Next, we generalize the idea of conditional computation of images to videos by presenting AdaFrame, a framework that adaptively selects relevant frames on a per-input basis for fast video recognition. AdaFrame assumes access to all frames in videos, and hence can be only used in offline settings. To mitigate this issue, we introduce LiteEval, a simple yet effective coarse-to-fine framework for resource efficient video recognition, suitable for both online and offline scenarios.
To derive robust feature representations with limited annotation resources, we first explore the power of spatial context as a supervisory signal for learning visual representations. In addition, we also propose to learn from synthetic data rendered by modern computer graphics tools, where ground-truth labels are readily available. We propose Dual Channel-wise Alignment Networks (DCAN), a simple yet effective approach to reduce domain shift at both pixel-level and feature-level, for unsupervised scene adaptation.