V-MAX: tempered optimism for better PAC reinforcement learning

K. Rao; S. Whiteson

V-MAX: tempered optimism for better PAC reinforcement learning

Authors	K. Rao S. Whiteson
Publication date	2012
Book title	AAMAS 2012: Proceedings of the Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems
ISBN	9780981738116
Event	The Eleventh International Joint Conference on Autonomous Agents and Multi-Agent Systems
Pages (from-to)	375-382
Publisher	AAMAS
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Recent advances in reinforcement learning have yielded several PAC-MDP algorithms that, using the principle of optimism in the face of uncertainty, are guaranteed to act near-optimally with high probability on all but a polynomial number of samples. Unfortunately, many of these algorithms, such as R-MAX, perform poorly in practice because their initial exploration in each state, before the associated model parameters have been learned with confidence, is random. Others, such as Model-Based Interval Estimation (MBIE) have weaker sample complexity bounds and require careful parameter tuning. This paper proposes a new PAC-MDP algorithm called V-MAX designed to address these problems. By restricting its optimism to future visits, V-MAX can exploit its experience early in learning and thus obtain more cumulative reward than R-MAX. Furthermore, doing so does not compromise the quality of exploration, as we prove bounds on the sample complexity of V-MAX that are identical to those of R-MAX. Finally, we present empirical results in two domains demonstrating that V-MAX can substantially outperform R-MAX and match or outperform MBIE while being easier to tune, as its performance is invariant to conservative choices of its primary parameter.
Document type	Conference contribution
Language	English
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

V-MAX: tempered optimism for better PAC reinforcement learning