Aligning language models with human understanding and behavior

Open Access
Authors
Supervisors
Cosupervisors
  • Z. Ren
Award date 03-07-2025
ISBN
  • 9789465221694
Number of pages 114
Organisations
  • Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract
Language models (LMs) have achieved impressive progress in natural language processing, yet they remain misaligned with human understanding and behavior, limiting their effectiveness in real-world applications. This thesis addresses these challenges by investigating LM alignment from two perspectives: aligning model understanding with humans, and aligning model behavior with humans. Specifically, we explore four key themes: (i) aligning understanding via debiased representation learning, (ii) aligning behavior via strong-to-weak learning, (iii) aligning behavior via weak-to-strong learning, and (iv) aligning behavior via test-time behavior reflection.
We begin by addressing representational alignment during fine-tuning, proposing a framework that reduces biased latent features and captures their dynamic influence, thereby improving out-of-distribution generalization. Then, in the strong-to-weak learning setting, we develop behavior alignment methods to improve completeness, factuality, and logicality in knowledge-intensive tasks, leveraging both fine-grained and coarse-grained knowledge signals. Next, we study the weak-to-strong alignment scenario, where stronger LMs must learn from weaker human supervision. To this end, we introduce an iterative preference optimization strategy that facilitates mutual learning between weak teachers and strong students. Finally, we focus on aligning behavior at inference time by mitigating cognitive biases in LM decision-making. We propose a method that follows three sequential steps—bias determination, bias analysis, and cognitive debiasing—to iteratively reduce potential cognitive biases in prompts.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back