Towards robust and domain invariant feature representations in Deep Learning

Thumbnail Image

Publication or External Link





A fundamental problem in perception-based systems is to define

and learn representations of the scene that are more robust and adaptive to several nuisance

factors. Over the recent past, for a variety of tasks involving images, learned representations have been empirically shown to outperform handcrafted ones.

However, their inability to generalize across varying data distributions poses the following

question: Do representations learned using deep networks just fit a given data distribution or do

they sufficiently model the underlying structure of the problem ? This question could be

understood using a simple example: If a learning algorithm is shown a number of images of a simple handwritten digit, then the representation learned should be generic enough to identify the same digit in

a different form. With regards to deep networks, although the learned representation has been shown

to be robust to various forms of synthetic distortions such as random noise, they fail in the presence of

more implicit forms of naturally occurring distortions. In this dissertation, we

propose approaches to mitigate the effect of such distortions and in the process, study some

vulnerabilities of deep networks to small imperceptible changes that occur in the given input.

The research problems that comprise this dissertation lie in the cross section of two open topics: (1)

Studying and developing methods that enable neural networks learn robust representations (2) Improving

generalization of neural nets across domains. The first part of the dissertation approaches the problem of robustness from two broad viewpoints: Robustness to external nuisance factors that occur in the data and robustness

(or a lack thereof) to perturbations of the learned feature space. In the second part, we focus on learning representations that are invariant to external covariate

shift, which is more commonly termed as domain shift.

Towards learning representations robust

to external nuisance factors, we propose an approach that couples a deep convolutional neural

network with a low-dimensional discriminative embedding learned using triplet probability

constraints to solve the unconstrained face analysis problem. While previous approaches in this area have proposed scalable yet ad-hoc solutions to this problem, we propose a principled and parameter free formulation which is based on maximum likelihood estimation. In addition, we employ the principle of transfer learning to realize a deep network architecture that can train faster and on lesser data yet significantly outperforms existing approaches on the unconstrained face verification task. We demonstrate the robustness

of the approach to challenges including age, pose, blur and clutter by performing clustering

experiments on challenging benchmarks.

Recent seminal works have shown that deep neural networks are susceptible to visually imperceptible perturbations of the input. In this dissertation, we build on their ideas in two unique ways: (a) We show that neural networks that perform pixel-wise semantic segmentation tasks also suffer from this vulnerability, despite being trained with more extra information compares to simple classification tasks. In addition, we present a novel self correcting mechanism in segmentation networks and provide an efficient way to generate such perturbations (b) We present a novel approach to regularize deep neural networks by perturbing intermediate layer activations in an efficient manner, thereby exploring the trade-off between conventional regularization and adversarial robustness within the context of very deep networks. Both of these works provide interesting directions towards understanding the secure nature of deep learning algorithms.

While humans find it extremely simple to generalize their knowledge across domains, machine learning algorithms including deep neural networks suffer from the problem of domain shift across what are commonly termed as 'source' (S)

and 'target' (T) distributions. Let the data that a learning algorithm

is trained on be sampled from S. If the real data used to evaluate the model is

then sampled from T, then the learnt model under-performs on the target data. This inability to generalize is characterized as domain shift.

Our attempt to address this problem involves learning a common

feature subspace, where distance between source and target distributions are minimized. Estimating the distance between different domains is highly non-trivial and is an open research

problem in itself. In our approach we parameterize the distance measure by using a Generative

Adversarial Network (GAN). A GAN involves a two player game between two mappings com-

monly termed as generator and discriminator. These mappings are learned simultaneously by

employing an adversarial game, i.e. by letting the generator fool the discriminator and enabling

the discriminator to outperform the generator. This adversarial game can be formulated as a

minimax problem. In our approach, we learn three mappings simultaneously: the generator,

discriminator and a feature mapping that contains information about both the content and the

domain of the input. We deploy a two-level minimax game, where the first level is a competition

between the generator and a discriminator similar to a GAN; the second level game is where the

feature mapping attempts to fool the discriminator thereby introducing domain invariance in

the learned feature representation. We have extensively evaluated this approach for different

tasks such as object classification and semantic segmentation, where we achieve state of the

art results across several real datasets. In addition to the conceptual novelty, our approach

presents a more efficient and scalable solution compared to other approaches that attempt to

solve the same problem.

In the final part of this dissertation, we describe some ongoing efforts and future directions of research. Inspired from the study of perturbations described above, we propose a novel metric on how to effectively choose pixels to label given an image, for a pixel-wise segmentation task. This has the potential to significantly reduce the labeling effort and our preliminary results for the task of semantic segmentation are encouraging. While the domain adaptation approach proposed above considered static images, we propose an extension to video data aided by the use of recurrent neural networks. Use of full temporal information, when available, provides the perceptual system additional context to disambiguate among smaller object classes that commonly occur in real scenes.