From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
| Authors |
|
|---|---|
| Publication date | 2025 |
| Host editors |
|
| Book title | The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) : proceedings of the conference |
| Book subtitle | ACL 2025 : July 27-August 1, 2025 |
| ISBN (electronic) |
|
| Event | 63rd Annual Meeting of the Association for Computational Linguistics |
| Volume | Issue number | 1 |
| Pages (from-to) | 19609-19642 |
| Publisher | Kerrville, TX: Association for Computational Linguistics |
| Organisations |
|
| Abstract |
Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long interaction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
|
| Document type | Conference contribution |
| Language | English |
| Published at | https://doi.org/10.18653/v1/2025.acl-long.964 |
| Downloads |
2025.acl-long.964
(Final published version)
|
| Permalink to this page | |
