Towards Fast and Efficient Representation Learning
Publication or External Link
The success of deep learning and convolutional neural networks in many fields is accompanied by a significant increase in the computation cost. With the increasing model complexity and pervasive usage of deep neural networks, there is a surge of interest in fast and efficient model training and inference on both cloud and embedded devices. Meanwhile, understanding the reasons for trainability and generalization is fundamental for its further development. This dissertation explores approaches for fast and efficient representation learning with a better understanding of the trainability and generalization. In particular, we ask following questions and provide our solutions: 1) How to reduce the computation cost for fast inference? 2) How to train low-precision models on resources-constrained devices? 3) What does the loss surface looks like for neural nets and how it affects generalization?
To reduce the computation cost for fast inference, we propose to prune filters from CNNs that are identified as having a small effect on the prediction accuracy. By removing filters with small norms together with their connected feature maps, the computation cost can be reduced accordingly without using special software or hardware. We show that simple filter pruning approach can reduce the inference cost while regaining close to the original accuracy by retraining the networks.
To further reduce the inference cost, quantizing model parameters with low-precision representations has shown significant speedup, especially for edge devices that have limited computing resources, memory capacity, and power consumption. To enable on-device learning on lower-power systems, removing the dependency of full-precision model during training is the key challenge. We study various quantized training methods with the goal of understanding the differences in behavior, and reasons for success or failure.
We address the issue of why algorithms that maintain floating-point representations work so well, while fully quantized training methods stall before training is complete. We show that training algorithms that exploit high-precision representations have an important greedy search phase that purely quantized training methods lack, which explains the difficulty of training using low-precision arithmetic.
Finally, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. We introduce a simple filter normalization method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. The sharpness of minimizers correlates well with generalization error when this visualization is used. Then, using a variety of visualizations, we explore how training hyper-parameters affect the shape of minimizers, and how network architecture affects the loss landscape.