Towards language models that benefit us all Studies on stereotypes, robustness, and values

Open Access
Authors
Supervisors
Cosupervisors
Award date 29-09-2025
ISBN
  • 9789464738933
Number of pages 355
Organisations
  • Faculty of Humanities (FGw) - Amsterdam Institute for Humanities Research (AIHR)
  • Faculty of Science (FNWI)
  • Interfacultary Research - Institute for Logic, Language and Computation (ILLC)
Abstract
As Large Language Models have evolved from single-task solvers to general-purpose chat engines, demarcating their capabilities and harms is posing a significant challenge. Systematic investigation of both is needed as the cornerstone to well-informed policy and technological advancement. In this dissertation, we study stereotypes, robustness and values in Large Language Models (LLMs), drawing on insights from search engine studies, linguistics, formal semantics, logic and philosophy. In Part One, we investigate stereotyping harms in Natural Language Processing systems, namely search autocomplete engines and LLMs, finding uneven safety behaviour across a diverse set of social groups in both cases. These findings lead us to investigate variability in LLM behaviour more broadly in Part Two where we study robustness of LLM capabilities across tasks and for reasoning in particular. Based on our findings, we chart a path towards more holistic evaluation practices for the field of Natural Language Processing. In Part Three, we take steps towards aligning LLMs so that they represent a variety of social groups and speakers of different languages. Firstly, we collect and annotate a multilingual dataset to assess LLM agreement with values across languages. Secondly, we develop a direct alignment approach for LLMs to improve the robustness of alignment across demographics and languages.
Document type PhD thesis
Language English
Downloads
Permalink to this page
cover
Back