V-MAX: tempered optimism for better PAC reinforcement learning

Authors
Publication date 2012
Book title AAMAS 2012: Proceedings of the Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems
ISBN
  • 9780981738116
Event The Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems
Pages (from-to) 375-382
Publisher AAMAS
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Recent advances in reinforcement learning have yielded several PAC-MDP algorithms that, using the principle of optimism in the face of uncertainty, are guaranteed to act near-optimally with high probability on all but a polynomial number of samples. Unfortunately, many of these algorithms, such as R-MAX, perform poorly in practice because their initial exploration in each state, before the associated model parameters have been learned with confidence, is random. Others, such as Model-Based Interval Estimation (MBIE) have weaker sample complexity bounds and require careful parameter tuning. This paper proposes a new PAC-MDP algorithm called V-MAX designed to address these problems. By restricting its optimism to future visits, V-MAX can exploit its experience early in learning and thus obtain more cumulative reward than R-MAX. Furthermore, doing so does not compromise the quality of exploration, as we prove bounds on the sample complexity of V-MAX that are identical to those of R-MAX. Finally, we present empirical results in two domains demonstrating that V-MAX can substantially outperform R-MAX and match or outperform MBIE while being easier to tune, as its performance is invariant to conservative choices of its primary parameter.
Document type Conference contribution
Language English
Permalink to this page
Back