This Is AuburnElectronic Theses and Dissertations

Harnessing visual context information to improve face identification accuracy and explainability

Date

2024-05-08

Author

Phan, Hai

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Abstract

Face identification (FI) is ubiquitous and drives many high-stake decisions made by the law enforcement. A common FI approach compares two images by taking the cosine similar- ity between their image embeddings. Yet, such approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped or rotated) not included in the training set or the gallery. Recently, interpreatable deep metric learning with structural matching (e.g. DIML [101] and Vision Transformers [27]) obtained significant outcomes in popular computer vision problems such as image classification, image clustering, etc. In this proposal, we present simple yet efficient schemes to exploit structural similarity for an interpretable face matching algorithms. We propose two following novel methods. • DeepFace-EMD: A re-ranking approach that compares two faces using the Earth Mover’s Distance on the deep, spatial features of image patches. • Face-ViT: A novel architectural design using Vision Transformers (ViTs) for out- of-distribution (OOD) face identification and show significant improvement in inference speed. We feed embeddings of both images through a pre-trained CNN by ArcFace [22], layers of a Transformer encoder, and two linear layers as part of a ViT. We train the model with 2M pairs sampled from the CASIA Webface [93]. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtain- ing similar results on in-distribution images. Moreover, our model demonstrates significant interoperability through the visualization of cross-attention.