MULTI-DOMAIN BIOMETRIC RECOGNITION USING FACE AND BODY EMBEDDINGS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Although image- or video-based biometric recognition boasts excellent performance in the visible spectrum even under unconstrained conditions with variations in pose, illumination, and resolution, biometric recognition in more challenging domains, such as infrared, surveillance imagery, or long-range imagery, remains a significant challenge due to domain shifts and limited labeled data. In this dissertation, we study the problem of multi-domain biometric recognition using face and body embeddings on the IARPA JANUS Benchmark Multi-domain Face (IJB-MDF) dataset.
While systems based on deep neural networks have produced remarkable performance on many tasks such as face/object detection and recognition, they also require large amounts of labeled training data. However, there are many applications where collecting a relatively large labeled training data may not be feasible due to time and/or financial constraints. Trying to train deep networks on these small datasets in the standard manner usually leads to serious over-fitting issues and poor generalization. We explore how a state-of-the-art deep learning pipeline for unconstrained visual face identification and verification can be adapted to domains with scarce data/label availability using a semi-supervised learning approach. The rationale for system adaptation and experiments are set in the following context - given a pretrained network (that was trained on a large training dataset in the source domain), adapt it to generalize onto a target domain using a relatively small labeled (typically hundred to ten thousand times smaller) and an unlabeled training dataset. We present algorithms and results of extensive experiments with varying training dataset sizes and composition, and model architectures, using the IJB-MDF dataset for training and evaluation with visible and short-wave infrared (SWIR) domains as the source and target domains respectively.
Next, we tackle some more challenging domains including visible surveillance, body-worn imagery, remote videos (captured at 300m, 400m and 500m) and short-wave infrared videos (captured at 15m and 30m). While significant research has been done in the fields of domain adaptation and domain generalization, in this dissertation we tackle scenarios in which these methods have limited applicability owing to the lack of training data from target domains. We focus on the problem of single-source (visible) and multi-target face recognition task. We demonstrate that the template generation algorithm plays a crucial role, especially as the complexity of the target domain increases. We propose a template generation algorithm called Norm Pooling (and a variant known as Sparse Pooling) and show that it outperforms traditional average pooling across different domains and network architectures using the IJB-MDF dataset.
Biometric recognition becomes increasingly challenging as we move away from the visible spectrum to infrared imagery, where domain discrepancies significantly impact identification performance. We show that body embeddings outperform face embeddings for cross-spectral person identification in the medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains. Due to the lack of multi-domain datasets, previous research on cross-spectral body identification - also known as Visible-Infrared Person Re-Identification (VI-ReID) - has primarily focused on individual infrared bands, such as near-infrared (NIR) or LWIR, separately. We address the multi-domain body recognition problem using the IJB-MDF dataset, which enables matching of SWIR, MWIR, and LWIR images against RGB (VIS) images. We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset and, through extensive experiments, provide valuable insights into the interrelation of infrared domains, the adaptability of VIS-pretrained models, the role of local semantic features in body-embeddings, and effective training strategies for small datasets. Additionally, we show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art results on the LLCM dataset.
Finally, we integrate Side Information Embedding (SIE) to the ViT architecture and examine the impact of encoding domain and camera information to enhance cross-spectral matching. Surprisingly, our results show that encoding only the camera information, without explicitly incorporating domain information, achieves state-of-the-art performance on the LLCM dataset. While occlusion handling has been extensively studied in visible-spectrum person re-identification (Re-ID), occlusions in VI-ReID remain largely underexplored — primarily because existing VI-ReID datasets, such as LLCM, SYSU-MM01, and RegDB, predominantly feature full-body, unoccluded images. To address this gap, we analyze the impact of range-induced occlusions using the IJB-MDF dataset, which provides a diverse set of visible and infrared images captured at various distances, enabling cross-range, cross-spectral evaluations.