Negative Sampling Techniques for Dense Passage Retrieval in a Multilingual Setting
| Authors | |
|---|---|
| Publication date | 2024 |
| Book title | SIGIR '24 |
| Book subtitle | Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval : July 14-18, 2024, Washington, DC, USA |
| ISBN (electronic) |
|
| Event | 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024 |
| Pages (from-to) | 575-584 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
The bi-encoder transformer architecture has become popular in open-domain retrieval, surpassing traditional sparse retrieval methods. Using hard negatives during training can improve the effectiveness of dense retrievers, and various techniques have been proposed to generate these hard negatives. We investigate the effectiveness of multiple negative sampling methods based on lexical methods (BM25), clustering, and periodically updated dense indices. We examine techniques that were introduced for finding hard negatives in a monolingual setting and reproduce them in a multilingual setting. We discover a gap amongst these techniques that we fill by proposing a novel clustered training method. Specifically, we focus on monolingual retrieval using multilingual dense retrievers across a broad set of diverse languages. We find that negative sampling based on BM25 negatives is surprisingly effective in an in-distribution setting, but this finding does not generalize to out-of-distribution and zero-shot settings, where the newly proposed method achieves the best results. We conclude with recommendations on which negative sampling methods may be the most effective given different multilingual retrieval scenarios.
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1145/3626772.3657854 |
| Downloads |
3626772.3657854
(Final published version)
|
| Permalink to this page | |
