Mathematical developments in abstract information theory and safe reward learning
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 02-02-2026 |
| ISBN |
|
| Number of pages | 426 |
| Organisations |
|
| Abstract |
Part I of this two-part thesis develops generalizations of Shannon’s information theory with the goal to find the broadest abstract setting that allows to prove fundamental results.
Chapter 1 replaces Shannon entropy with general functions on commutative, idempotent monoids that satisfy a chain rule. This leads to a natural generalization of Hu’s theorem and the classical Venn diagram visualizations of entropy, mutual information, and interaction information. We apply this to Kullback-Leibler divergence, Kolmogorov complexity, and other settings. Chapter 2 develops a theory of abstract Markov random fields in terms of their effects on their information diagrams. In a special case, we interpret the second law of thermodynamics using a Kullback-Leibler diagram. Part II develops theoretical foundations for improving the safety of AI systems to decrease the risks of future advanced AI. The contributions focus on the safe optimization and learning of reward models. Chapter 3 proves that even if learned reward models have a low error on their training data, the distribution shift caused by policy optimization can lead to significant regret. Chapter 4 theoretically analyzes learning reward models from feedback provided by humans who only partially observe the AI’s behavior. This causes the AI to engage in deceptive and overjustifying behavior, even if there are no additional approximation errors. Building on this work, Chapter 5 proposes modeling humans’ imperfect beliefs about the AI’s behavior. We find theoretical conditions under which such models allow us to infer the true reward function, and we outline how to accomplish this in practice. |
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
