Safe, efficient and robust reinforcement learning for ranking and diffusion models

Open Access
Authors
Supervisors
Cosupervisors
Award date 13-10-2025
ISBN
  • 9789493483033
Number of pages 117
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains: ranking and recommendation, and text-to-image diffusion models.
The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss.
The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability.
The final part examines efficiency–effectiveness trade-offs in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO’s clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.
Overall, the dissertation demonstrates that (i) ranking policies can be deployed safely without strong assumptions, (ii) variance reduction in bandits can be unified through a single baseline, and (iii) simple RL modifications can efficiently fine-tune diffusion models. These results establish a principled foundation for safe and data-efficient RL in both information access and generative AI.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back