Negative Sampling Techniques for Dense Passage Retrieval in a Multilingual Setting

Open Access
Authors
Publication date 2024
Book title SIGIR '24
Book subtitle Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval : July 14-18, 2024, Washington, DC, USA
ISBN (electronic)
  • 9798400704314
Event 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024
Pages (from-to) 575-584
Publisher New York, NY: Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
The bi-encoder transformer architecture has become popular in open-domain retrieval, surpassing traditional sparse retrieval methods. Using hard negatives during training can improve the effectiveness of dense retrievers, and various techniques have been proposed to generate these hard negatives. We investigate the effectiveness of multiple negative sampling methods based on lexical methods (BM25), clustering, and periodically updated dense indices. We examine techniques that were introduced for finding hard negatives in a monolingual setting and reproduce them in a multilingual setting. We discover a gap amongst these techniques that we fill by proposing a novel clustered training method. Specifically, we focus on monolingual retrieval using multilingual dense retrievers across a broad set of diverse languages. We find that negative sampling based on BM25 negatives is surprisingly effective in an in-distribution setting, but this finding does not generalize to out-of-distribution and zero-shot settings, where the newly proposed method achieves the best results. We conclude with recommendations on which negative sampling methods may be the most effective given different multilingual retrieval scenarios.
Document type Conference contribution
Language English
Published at https://doi.org/10.1145/3626772.3657854
Downloads
3626772.3657854 (Final published version)
Permalink to this page
Back