Safe, efficient and robust reinforcement learning for ranking and diffusion models

S. Gupta

Safe, efficient and robust reinforcement learning for ranking and diffusion models

Authors	S. Gupta
Supervisors	M. de Rijke
Cosupervisors	H. Oosterhuis
Award date	13-10-2025
ISBN	9789493483033
Number of pages	117
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains: ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines efficiency–effectiveness trade-offs in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO’s clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes. Overall, the dissertation demonstrates that (i) ranking policies can be deployed safely without strong assumptions, (ii) variance reduction in bandits can be unified through a single baseline, and (iii) simple RL modifications can efficiently fine-tune diffusion models. These results establish a principled foundation for safe and data-efficient RL in both information access and generative AI.
Document type	PhD thesis
Language	English
Downloads	Thesis
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Safe, efficient and robust reinforcement learning for ranking and diffusion models