Effective Training and Efficient Inference of Deep Neural Networks for Visual Understanding

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2022

Citation

Abstract

Since the phenomenal success of deep neural networks (DNNs) on image classification, the research community have been developing wider and deeper networks with complex components for a variety of visual understanding tasks. While such “heavy” models achieve excellent performance, they pose two main challenges: (1) the training requires a significant amount of computational resource as well as large-scale labeled datasets acquired from time-consuming and labor-intensive human annotation process; and (2) the inference can be slow even with expensive graphics cards due to the high model complexity. To address these challenges, we explore improving the effectiveness of training DNNs so that better performance is achieved under the same computation and/or annotation cost during training, and improving the efficiency of inference that reduces the computational cost of DNNs while maintaining high accuracy.

In this dissertation, we first propose several approaches including devising noise-aware supervisory signals, developing better semi-supervised learning methods and analyzing different pre-training techniques for training object recognition and detection models more effectively. In the second part, we present two adaptive computation frameworks that improve the inference efficiency of 3D convolutional networks and attention-based vision Transformers for the tasks ofimage and video classification.

Specifically, we first introduce NoisyAnchor, in which we identify the intrinsic label noise generated from the harsh and binary IoU-based (Intersection-over-Union)foreground/background split of training samples in object detection, and mitigate such noise through deriving a cleanliness score with the detectors’ output and down-weight noisy training samples with further derived soft category labels and loss re-weighting coefficients. We then seek to boost object detection performance with the readily available unannotated images, and propose improved semi-supervised learning (SSL) techniques that aim at addressing two unique challenges of semi-supervised object detection, i.e. the lack of localization quality estimation and the amplified class imbalance when generating pseudo labels. In the third work we empirically analyze the differences of the impact on downstream tasks when pre-training on image classification and object detection, providing more intuitions and actionable practices for effective task-specific pre-training.

To improve inference efficiency, we explore adaptive computation methods that produce input-specific inference policies for an overall reduced computational cost, and present Ada3D and AdaViT. In particular, Ada3D learns to adaptively allocate computational resources by selectively keeping informative input frames and activating 3D convolutional layers of 3D models on a per-input basis for video classification; AdaViT exploits the redundancy of self-attention mechanism in Vision Transformers for image classification and improves their efficiency through deriving input-specific usage policies on which patches, self-attention heads and transformer blocks to use throughout the backbone. Such adaptive computation methods tend to allocate less computation for “easy” images and “static” videos, thus resulting in a reduced computational cost.

Notes

Rights