This Is AuburnElectronic Theses and Dissertations

Show simple item record

Harnessing visual context information to improve face identification accuracy and explainability


Metadata FieldValueLanguage
dc.contributor.advisorNguyen, Anh
dc.contributor.authorPhan, Hai
dc.date.accessioned2024-05-08T17:44:41Z
dc.date.available2024-05-08T17:44:41Z
dc.date.issued2024-05-08
dc.identifier.urihttps://etd.auburn.edu//handle/10415/9292
dc.description.abstractFace identification (FI) is ubiquitous and drives many high-stake decisions made by the law enforcement. A common FI approach compares two images by taking the cosine similar- ity between their image embeddings. Yet, such approach suffers from poor out-of-distribution (OOD) generalization to new types of images (e.g., when a query face is masked, cropped or rotated) not included in the training set or the gallery. Recently, interpreatable deep metric learning with structural matching (e.g. DIML [101] and Vision Transformers [27]) obtained significant outcomes in popular computer vision problems such as image classification, image clustering, etc. In this proposal, we present simple yet efficient schemes to exploit structural similarity for an interpretable face matching algorithms. We propose two following novel methods. • DeepFace-EMD: A re-ranking approach that compares two faces using the Earth Mover’s Distance on the deep, spatial features of image patches. • Face-ViT: A novel architectural design using Vision Transformers (ViTs) for out- of-distribution (OOD) face identification and show significant improvement in inference speed. We feed embeddings of both images through a pre-trained CNN by ArcFace [22], layers of a Transformer encoder, and two linear layers as part of a ViT. We train the model with 2M pairs sampled from the CASIA Webface [93]. Our extra comparison stage explicitly examines image similarity at a fine-grained level (e.g., eyes to eyes) and is more robust to OOD perturbations and occlusions than traditional FI. Interestingly, without finetuning feature extractors, our method consistently improves the accuracy on all tested OOD queries: masked, cropped, rotated, and adversarial while obtain- ing similar results on in-distribution images. Moreover, our model demonstrates significant interoperability through the visualization of cross-attention.en_US
dc.subjectComputer Science and Software Engineeringen_US
dc.titleHarnessing visual context information to improve face identification accuracy and explainabilityen_US
dc.typePhD Dissertationen_US
dc.embargo.statusNOT_EMBARGOEDen_US
dc.embargo.enddate2024-05-08en_US
dc.contributor.committeeHe, Pan

Files in this item

Show simple item record