From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Nathanaël Carraz Rakotonirina; Mohammed Hamdy; Jon Ander Campos; Lucas Weber; A. Testoni; Marzieh Fadaee; S. Pezzelle; Marco Del Tredici

doi:https://doi.org/10.18653/v1/2025.acl-long.964

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions

Authors	Nathanaël Carraz Rakotonirina Mohammed Hamdy Jon Ander Campos Lucas Weber A. Testoni Marzieh Fadaee S. Pezzelle Marco Del Tredici
Publication date	2025
Host editors	W. Che J. Nabende E. Shutova M.T. Pilehvar
Book title	The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) : proceedings of the conference
Book subtitle	ACL 2025 : July 27-August 1, 2025
ISBN (electronic)	9798891762510
Event	63rd Annual Meeting of the Association for Computational Linguistics
Volume \| Issue number	1
Pages (from-to)	19609-19642
Publisher	Kerrville, TX: Association for Computational Linguistics
Organisations	Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract	Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long interaction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.18653/v1/2025.acl-long.964
Downloads	2025.acl-long.964 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions