Context Embeddings for Efficient Answer Generation in Retrieval-Augmented Generation

D. Rau; S. Wang; H. Déjean; S. Clinchant; J. Kamps

doi:https://doi.org/10.1145/3701551.3703527

Context Embeddings for Efficient Answer Generation in Retrieval-Augmented Generation

Authors	D. Rau S. Wang H. Déjean S. Clinchant J. Kamps
Publication date	2025
Book title	WSDM '25
Book subtitle	Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining : March 10-14, 2025, Hannover, Germany
ISBN (electronic)	9798400713293
Event	18th ACM International Conference on Web Search and Data Mining
Pages (from-to)	493-502
Number of pages	10
Publisher	New York, NY: Association for Computing Machinery
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer slowing down decoding time affecting the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings, speeding up the generation time by a large margin. Our method allows for different compression rates, trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates an inference speed-up of up to 5.69 times while achieving higher performance compared to existing efficient context compression methods
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1145/3701551.3703527
Downloads	3701551.3703527 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Context Embeddings for Efficient Answer Generation in Retrieval-Augmented Generation