Harnessing visual context information to improve face identification accuracy and explainability

Phan, Hai

Metadata Field	Value	Language
dc.contributor.advisor	Nguyen, Anh
dc.contributor.author	Phan, Hai
dc.date.accessioned	2024-05-08T17:44:41Z
dc.date.available	2024-05-08T17:44:41Z
dc.date.issued	2024-05-08
dc.identifier.uri	https://etd.auburn.edu//handle/10415/9292
dc.description.abstract	Face identification (FI) is ubiquitous and drives many high-stake decisions made by the law enforcement. A common FI approach compares two images by taking the cosine similar- ity between their image embeddings. Yet, such approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped or rotated) not included in the training set or the gallery. Recently, interpreatable deep metric learning with structural matching (e.g. DIML [101] and Vision Transformers [27]) obtained significant outcomes in popular computer vision problems such as image classification, image clustering, etc. In this proposal, we present simple yet efficient schemes to exploit structural similarity for an interpretable face matching algorithms. We propose two following novel methods. • DeepFace-EMD: A re-ranking approach that compares two faces using the Earth Mover’s Distance on the deep, spatial features of image patches. • Face-ViT: A novel architectural design using Vision Transformers (ViTs) for out- of-distribution (OOD) face identification and show significant improvement in inference speed. We feed embeddings of both images through a pre-trained CNN by ArcFace [22], layers of a Transformer encoder, and two linear layers as part of a ViT. We train the model with 2M pairs sampled from the CASIA Webface [93]. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtain- ing similar results on in-distribution images. Moreover, our model demonstrates significant interoperability through the visualization of cross-attention.	en_US
dc.subject	Computer Science and Software Engineering	en_US
dc.title	Harnessing visual context information to improve face identification accuracy and explainability	en_US
dc.type	PhD Dissertation	en_US
dc.embargo.status	NOT_EMBARGOED	en_US
dc.embargo.enddate	2024-05-08	en_US
dc.contributor.committee	He, Pan

Files in this item

Name:: Ph_D_Thesis_HP.pdf
Size:: 5.922Mb

Show simple item record