Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets

Authors
Publication date 2026
Host editors
  • A. Panchenko
  • D. Gubanov
  • M. Khachay
  • A. Kuznetsov
  • N. Loukachevitch
  • A. Kuznetsov
  • I. Nikishina
  • M. Panov
  • P.M. Pardalos
  • A.V. Savchenko
  • E. Tsymbalov
  • E. Tutubalina
  • A. Kasieva
  • D.I. Ignatov
Book title Analysis of Images, Social Networks and Texts
Book subtitle 12th International Conference, AIST 2024, Bishkek, Kyrgyzstan, October 17–19, 2024 : revised selected papers
ISBN
  • 9783031970184
ISBN (electronic)
  • 9783031970191
Series Communications in Computer and Information Science
Event 12th International Conference on Analysis of Images, Social Networks and Texts
Pages (from-to) 153-163
Publisher Cham: Springer
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly trained using the VoxCeleb dataset. This may lead to varying SV performance when using the models for inference on real-world data. To research these possible variations in performance, three establised SV models, namely the ECAPA-TDNN, ResNet and WavLM, are evaluated on the UCLA variability, CommonVoice, FRIDA and Wyred datasets. The ECAPA-TDNN and ResNet models are found to perform slightly worse when compared with the VoxCeleb evaluation results while the WavLM model performs significantly worse. The ResNet model shows the best performance on all four datasets. After evaluation, the ResNet model is improved by fine-tuning the model on the UCLA dataset and, further by creating a Deep Weight Space Ensemble (WSE) model between the pre-trained and fine-tuned models. Between the pre-trained, fine-tuned and WSE models, the WSE model has the best overall performance, attaining the best scores on the UCLA test set. Scores for the other three datasets show a lower decrease than the fine-tuned model. This indicates that fine-tuning with WSE can alleviate the loss in model performance on real-world data.
Document type Conference contribution
Language English
Published at https://doi.org/10.1007/978-3-031-97019-1_12
Permalink to this page
Back