Context Embeddings for Efficient Answer Generation in Retrieval-Augmented Generation
| Authors |
|
|---|---|
| Publication date | 2025 |
| Book title | WSDM '25 |
| Book subtitle | Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining : March 10-14, 2025, Hannover, Germany |
| ISBN (electronic) |
|
| Event | 18th ACM International Conference on Web Search and Data Mining |
| Pages (from-to) | 493-502 |
| Number of pages | 10 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
Retrieval-Augmented Generation (RAG) allows overcoming the
limited knowledge of LLMs by extending the input with external
information. As a consequence, the contextual inputs to the model become
much longer slowing down decoding time affecting the time a user has to
wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings,
speeding up the generation time by a large margin. Our method allows
for different compression rates, trading off decoding time for answer
quality. Compared to earlier methods, COCOM allows for handling
multiple contexts more effectively, significantly reducing decoding time
for long inputs. Our method demonstrates an inference speed-up of up to
5.69 times while achieving higher performance compared to existing
efficient context compression methods
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.1145/3701551.3703527 |
| Downloads |
3701551.3703527
(Final published version)
|
| Permalink to this page | |
