Assessing Reliability in AI-Powered Learning Systems with A/A Tests
| Authors | |
|---|---|
| Publication date | 2025 |
| Book title | L@S '25 |
| Book subtitle | Proceedings of the Twelfth ACM Conference on Learning @ Scale : July 21-23, 2025, Palermo, Italy |
| ISBN (electronic) |
|
| Event | 12th ACM Conference on Learning @ Scale, L@S 2025 |
| Pages (from-to) | 13-23 |
| Number of pages | 11 |
| Publisher | New York, NY: Association for Computing Machinery |
| Organisations |
|
| Abstract |
The rapid evolution of Artificial Intelligence (AI) has expanded access to large-scale online adaptive learning systems. Such AI-powered systems strive to deliver personalized learning experiences, often by means of advanced algorithms that continuously model learner behavior. Ensuring the reliability of these systems is fundamental, otherwise their ability to optimize individual learning paths and inform decision-making is undermined. In order to trust such systems, learners with identical profiles should follow highly similar learning trajectories. But how can we evaluate the reliability of these dynamic learning environments, especially in systems that are continuously developed and updated? This paper demonstrates the effectiveness of A/A testing - - large-scale double-blind experiments with identical conditions - - in systematically evaluating the reliability of AI-powered learning environments. We illustrate this by assessing the reliability of student model parameters in a large-scale online arithmetic learning platform that is driven by a well-studied and powerful explainable AI algorithm. We duplicated the item bank of a newly developed game and randomly assigned 50% of the players to one of two identical versions, which were launched simultaneously in the live environment. We then analyzed the reliability of item difficulty convergence, the stability of student ability estimates in the new game, and their relationship to ability estimates from other arithmetic games, as well as patterns in student errors. Our results indicate that the student model parameters are stable across the two variants, highlighting A/A testing as a valuable tool for assessing the reliability of large-scale AI-powered learning systems. We discuss its advantages and suggest future directions for adapting the approach, while considering its relevance in dynamic learning environments. |
| Document type | Conference contribution |
| Note | With supplementary video |
| Language | English |
| Published at | https://doi.org/10.1145/3698205.3729553 |
| Other links | https://www.scopus.com/pages/publications/105013071212 |
| Downloads |
3698205.3729553
(Final published version)
|
| Supplementary materials | |
| Permalink to this page | |
