USING CNNS TO UNDERSTAND LIGHTING WITHOUT REAL LABELED TRAINING DATA

Loading...
Thumbnail Image

Files

Publication or External Link

Date

2019

Citation

Abstract

The task of computer vision is to make computers understand the physical word through images. Lighting is the medium through which we capture images of the physical world. Without lighting, there is no image, and dierent lighting leads to dierent images of the same physical world. In this dissertation, we study how to understand lighting from images. With the emergence of large datasets and deep learning in recent years, learning based methods play a more and more important role in computer vision, and deep Convolutional Neural Networks (CNNs) now dominate most of the problems in computer vision. Despite their success, deep CNNs are notorious for their data hungry nature compared with traditional learning based methods. While collecting images from the internet is easy and fast, labeling those images is both time consuming and expensive, and sometimes, even impossible. In this work, we focus on understanding lighting from faces and natural scenes, in which ground truth labels of the lighting are impossible to achieve.

As a preliminary topic, we rst study the capacity of deep CNNs. Designing deep CNNs with less capacity and good generalization is one way to reduce the amount of labeled data needed in training deep CNNs, and understanding the capacity of deep CNNs is the rst step towards that goal. In this work, we empirically study the capacity of deep CNNs by studying the redundancy of parameters in them. More specically, we aim at optimizing the number of neurons in a network, thus the number of parameters. To achieve that goal, we incorporate sparse constraints into the objective function and apply a forward-backward splitting method to solve this sparse constrained optimization problem eciently. The proposed method can signicantly reduce the number of parameters, showing that networks with small capacity can work well.

We then study an important problem in computer vision: inverse lighting from a single face image. Lacking massive ground truth lighting labels, we generate a large amount of synthetic data with ground truth lighting to train a deep network. However, due to the large domain gap between real and synthetic data, the network trained using synthetic data cannot generalize well to real data. We thus propose to use real data to train the deep CNN together with synthetic data. We apply an existing method to estimate lighting conditions of real face images. However, these lighting labels are noisy. We then propose a Label Denoising Adversarial Network (LDAN) to make use of these synthetic data to help train a deep CNN to regress lighting from real face images, denoising labels of real images. We have shown that the proposed method can generate more consistent lighting for faces taken under the same lighting condition.

Third, we study how to relight a face image using deep CNNs. We formulate this problem as a supervised image to image translation problem. Due to the lack of a "in the wild" face dataset that is suitable for this task, we apply a physically based face relighting method to generate a large scale, high resolution, "in the wild" portrait relighting dataset (DPR). A deep Convolutional Neural Network (CNN) is then trained using this dataset to generate a relighted portrait image by taking a source image and a target lighting as input. We show that our training procedure can regularize the generated results, removing the artifacts caused by physically-based relighting methods.

Fourth, we study how to understand lighting from a natural scene based on an RGB image. We propose a Global-Local Spherical Harmonics (GLoSH) lighting model to improve the lighting representation, and jointly predict refectance and surface normals. The global SH models the holistic lighting while local SHs account for the spatial variation of lighting. A novel non-negative lighting constraint is proposed to encourage the estimated SHs to be physically meaningful. To seamlessly make use of the GLoSH model, we design a coarse-to-ne network structure. Lacking labels for refectance and lighting, we apply synthetic data for model pre-training and fine-tune the model with real data in a self-supervised way. We have shown that the proposed method outperforms state-of-the-art methods in understanding lighting, refectance and shading of a natural scene.

Notes

Rights