Learning from context with multimodal foundation models
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 21-11-2025 |
| ISBN |
|
| Number of pages | 127 |
| Organisations |
|
| Abstract |
This thesis investigates how multimodal foundation models can learn from context to enhance understanding, generation, and alignment across vision and language. By leveraging contextual cues across modalities, we introduce methods that improve adaptability and performance in diverse multimodal settings. In the first chapter, we address learning from a few-shot examples with frozen vision and language backbones. We introduce a meta-learning framework that bridges vision and language domains to enable fast adaptation and knowledge transfer across multimodal few-shot tasks. The second chapter focuses on in-context image generation and presents Context Diffusion, a diffusion-based framework that learns directly from visual examples provided in context. Unlike prior approaches that rely heavily on textual prompts, Context Diffusion generates high-quality, contextually faithful images given visual, textual, or combined inputs. In the third chapter, we study contrastive vision-language models such as CLIP and their reliance on a fixed context length. We propose TULIP, a method that incorporates relative position encodings and distills knowledge from the original CLIP text encoder, to enable processing captions of arbitrary length. This leads to significant improvements in long-caption retrieval and image generation tasks. Finally, the last chapter explores long-caption generation, specifically focusing on the generation of medical imaging reports. We introduce variational topic inference, a framework that captures sentence topic diversity, producing coherent, contextually grounded reports aligned with image semantics. Together, these contributions advance learning from context, enabling multimodal foundation models to better understand, generate, and communicate across modalities.
|
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
