Thumbnail Image
Publication or External Link
Kumar, Amit
Chellappa, Rama
Localization of regions of interest on images and videos is a well studied prob- lem in computer vision community. Usually localization tasks imply localization of objects in a given image, such as detection and segmentation of objects in images. However, the regions of interests can be limited to a single pixel as in the task of facial landmark localization or human pose estimation. This dissertation studies ro- bust facial landmark detection algorithms for faces in the wild using learning methods based on Convolution Neural Networks. Detection of specific keypoints on face images is an integral pre-processing step in facial biometrics and numerous other applications including face verification and identification. Detecting keypoints allows to align face images to a canonical coordi- nate system using geometric transforms such as similarity or affine transformations mitigating the adverse affects of rotation and scaling. This challenging problem has become more attractive in recent years as a result of advances in deep learning and release of more unconstrained datasets. The research community is pushing bound-aries to achieve better and better performance on unconstrained images, where the images are diverse in pose, expression and lightning conditions. Over the years, researchers have developed various hand crafted techniques to extract meaningful features from features, most of them being appearance and geometry-based features. However, these features do not perform well for data col- lected in unconstrained settings due to large variations in appearance and other nui- sance factors. Convolution Neural Networks (CNNs) have become prominent because of their ability to extract discriminating features. Unlike the hand crafted features, DCNNs perform feature extraction and feature classification from the data itself in an end-to-end fashion. This enables the DCNNs to be robust to variations present in the data and at the same time improve their discriminative ability. In this dissertation, we discuss three different methods for facial keypoint de- tection based on Convolution Neural Networks. The methods are generic and can be extended to a related problem of keypoint detection for human pose estimation. The first method called Cascaded Local Deep Descriptor Regression uses deep features ex- tracted around local points to learn linear regressors for incrementally correcting the initial estimate of the keypoints. In the second method, called KEPLER, we develop efficient Heatmap CNNs to directly learn the non-linear mapping between the input and target spaces. We also apply different regularization techniques to tackle the effects of imbalanced data and vanishing gradients. In the third method, we model the spatial correlation between different keypoints using Pose Conditioned Convo- lution Deconvolution Networks (PCD-CNN) while at the same time making it pose agnostic by disentangling pose from the face image. Next, we show an applicationof facial landmark localization used to align the face images for the task of apparent age estimation of humans from unconstrained images. In the fourth part of this dissertation we discuss the impact of good quality landmarks on the task of face verification. Previously proposed methods perform with reasonable accuracy on high resolution and good quality images, but fail when the input image suffers from degradation. To this end, we propose a semi-supervised method which aims at predicting landmarks in the low quality images. This method learns to predict landmarks in low resolution images by learning to model the learning process of high resolution images. In this algorithm, we use Generative Adversarial Networks, which first learn to model the distribution of real low resolution images after which another CNN learns to model the distribution of heatmaps on the images. Additionally, we also propose another high quality facial landmark detection method, which is currently state of the art. Finally, we also discuss the extension of ideas developed for facial keypoint localization for the task of human pose estimation, which is one of the important cues for Human Activity Recognition. As in PCD-CNN, the parts of human body can also be modelled in a tree structure, where the relationship between these parts are learnt through convolutions while being conditioned on the 3D pose and orientation. Another interesting avenue for research is extending facial landmark localization to naturally degraded images.