Evaluating Deep Learning-Based Speaker Verification Systems: A Comparative Study Across Open-Source and Forensic Datasets
| Authors |
|
|---|---|
| Publication date | 2026 |
| Host editors |
|
| Book title | Analysis of Images, Social Networks and Texts |
| Book subtitle | 12th International Conference, AIST 2024, Bishkek, Kyrgyzstan, October 17–19, 2024 : revised selected papers |
| ISBN |
|
| ISBN (electronic) |
|
| Series | Communications in Computer and Information Science |
| Event | 12th International Conference on Analysis of Images, Social Networks and Texts |
| Pages (from-to) | 153-163 |
| Publisher | Cham: Springer |
| Organisations |
|
| Abstract |
Speaker verification (SV) is the process of verifying whether speech from two audio signals originate from the same speaker or different speakers. Current state-of-the-art SV systems are based on deep neural networds, predominantly trained using the VoxCeleb dataset. This may lead to varying SV performance when using the models for inference on real-world data. To research these possible variations in performance, three establised SV models, namely the ECAPA-TDNN, ResNet and WavLM, are evaluated on the UCLA variability, CommonVoice, FRIDA and Wyred datasets. The ECAPA-TDNN and ResNet models are found to perform slightly worse when compared with the VoxCeleb evaluation results while the WavLM model performs significantly worse. The ResNet model shows the best performance on all four datasets. After evaluation, the ResNet model is improved by fine-tuning the model on the UCLA dataset and, further by creating a Deep Weight Space Ensemble (WSE) model between the pre-trained and fine-tuned models. Between the pre-trained, fine-tuned and WSE models, the WSE model has the best overall performance, attaining the best scores on the UCLA test set. Scores for the other three datasets show a lower decrease than the fine-tuned model. This indicates that fine-tuning with WSE can alleviate the loss in model performance on real-world data.
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1007/978-3-031-97019-1_12 |
| Permalink to this page | |