Thumbnail Image


Publication or External Link





Localization of regions of interest on images and videos is a well studied prob-

lem in computer vision community. Usually localization tasks imply localization of

objects in a given image, such as detection and segmentation of objects in images.

However, the regions of interests can be limited to a single pixel as in the task of

facial landmark localization or human pose estimation. This dissertation studies ro-

bust facial landmark detection algorithms for faces in the wild using learning methods

based on Convolution Neural Networks.

Detection of specific keypoints on face images is an integral pre-processing step

in facial biometrics and numerous other applications including face verification and

identification. Detecting keypoints allows to align face images to a canonical coordi-

nate system using geometric transforms such as similarity or affine transformations

mitigating the adverse affects of rotation and scaling. This challenging problem has

become more attractive in recent years as a result of advances in deep learning and

release of more unconstrained datasets. The research community is pushing bound-aries to achieve better and better performance on unconstrained images, where the

images are diverse in pose, expression and lightning conditions.

Over the years, researchers have developed various hand crafted techniques

to extract meaningful features from features, most of them being appearance and

geometry-based features. However, these features do not perform well for data col-

lected in unconstrained settings due to large variations in appearance and other nui-

sance factors. Convolution Neural Networks (CNNs) have become prominent because

of their ability to extract discriminating features. Unlike the hand crafted features,

DCNNs perform feature extraction and feature classification from the data itself in

an end-to-end fashion. This enables the DCNNs to be robust to variations present

in the data and at the same time improve their discriminative ability.

In this dissertation, we discuss three different methods for facial keypoint de-

tection based on Convolution Neural Networks. The methods are generic and can be

extended to a related problem of keypoint detection for human pose estimation. The

first method called Cascaded Local Deep Descriptor Regression uses deep features ex-

tracted around local points to learn linear regressors for incrementally correcting the

initial estimate of the keypoints. In the second method, called KEPLER, we develop

efficient Heatmap CNNs to directly learn the non-linear mapping between the input

and target spaces. We also apply different regularization techniques to tackle the

effects of imbalanced data and vanishing gradients. In the third method, we model

the spatial correlation between different keypoints using Pose Conditioned Convo-

lution Deconvolution Networks (PCD-CNN) while at the same time making it pose

agnostic by disentangling pose from the face image. Next, we show an applicationof facial landmark localization used to align the face images for the task of apparent

age estimation of humans from unconstrained images.

In the fourth part of this dissertation we discuss the impact of good quality

landmarks on the task of face verification. Previously proposed methods perform

with reasonable accuracy on high resolution and good quality images, but fail when

the input image suffers from degradation. To this end, we propose a semi-supervised

method which aims at predicting landmarks in the low quality images. This method

learns to predict landmarks in low resolution images by learning to model the learning

process of high resolution images. In this algorithm, we use Generative Adversarial

Networks, which first learn to model the distribution of real low resolution images

after which another CNN learns to model the distribution of heatmaps on the images.

Additionally, we also propose another high quality facial landmark detection method,

which is currently state of the art.

Finally, we also discuss the extension of ideas developed for facial keypoint

localization for the task of human pose estimation, which is one of the important

cues for Human Activity Recognition. As in PCD-CNN, the parts of human body

can also be modelled in a tree structure, where the relationship between these parts are

learnt through convolutions while being conditioned on the 3D pose and orientation.

Another interesting avenue for research is extending facial landmark localization to

naturally degraded images.