A Test Collection of Synthetic Documents for Training Rankers ChatGPT vs. Human Experts

Open Access
Authors
Publication date 2023
Book title CIKM '23
Book subtitle Proceedings of the 32nd ACM International Conference on Information and Knowledge Management : October 21-25, 2023, Birmingham, England
ISBN (electronic)
  • 9798400701245
Event 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023
Pages (from-to) 5311-5315
Number of pages 5
Publisher New York, NY: Association for Computing Machinery
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract

We investigate the usefulness of generative large language models (LLMs) in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of strong models fine-tuned on both LLM-generated and human-generated data. We build ChatGPT-RetrievalQA based on an existing dataset, the human ChatGPT comparison corpus (HC3), consisting of multiple public question collections featuring both human- and ChatGPT-generated responses. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on LLM-generated responses are significantly more effective for out-of-domain re-ranking than those trained on human responses. For in-domain re-ranking, however, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models and can be used to augment training data, especially in domains with less labeled data. ChatGPT-RetrievalQA presents various opportunities for analyzing and improving rankers with both human- and LLM-generated data. Our data, code, and model checkpoints are publicly available.

Document type Conference contribution
Language English
Published at https://doi.org/10.1145/3583780.3615111
Other links https://github.com/arian-askari/ChatGPT-RetrievalQA https://www.scopus.com/pages/publications/85178122401
Downloads
3583780.3615111 (Final published version)
Permalink to this page
Back