Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

doi:https://doi.org/10.1007/978-3-031-97019-1_12

Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

Authors	Ian van Beveren Eleni Sergidou S. Mohammadi Ziabari
Publication date	2026
Host editors	A. Panchenko D. Gubanov M. Khachay A. Kuznetsov N. Loukachevitch A. Kuznetsov I. Nikishina M. Panov P.M. Pardalos A.V. Savchenko E. Tsymbalov E. Tutubalina A. Kasieva D.I. Ignatov
Book title	Analysis of Images, Social Networks and Texts
Book subtitle	12th International Conference, AIST 2024, Bishkek, Kyrgyzstan, October 17–19, 2024 : revised selected papers
ISBN	9783031970184
ISBN (electronic)	9783031970191
Series	Communications in Computer and Information Science
Event	12th International Conference on Analysis of Images, Social Networks and Texts
Pages (from-to)	153-163
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly trained using the VoxCeleb dataset. This may lead to varying SV performance when using the models for inference on real-world data. To research these possible variations in performance, three establised SV models, namely the ECAPA-TDNN, ResNet and WavLM, are evaluated on the UCLA variability, CommonVoice, FRIDA and Wyred datasets. The ECAPA-TDNN and ResNet models are found to perform slightly worse when compared with the VoxCeleb evaluation results while the WavLM model performs significantly worse. The ResNet model shows the best performance on all four datasets. After evaluation, the ResNet model is improved by fine-tuning the model on the UCLA dataset and, further by creating a Deep Weight Space Ensemble (WSE) model between the pre-trained and fine-tuned models. Between the pre-trained, fine-tuned and WSE models, the WSE model has the best overall performance, attaining the best scores on the UCLA test set. Scores for the other three datasets show a lower decrease than the fine-tuned model. This indicates that fine-tuning with WSE can alleviate the loss in model performance on real-world data.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1007/978-3-031-97019-1_12
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets