From Content-Based Image and Video Retrieval With Relevance Feedback to Multimodal Large Language Models and Beyond: A Journey Through Modalities and Semantic Levels

Stevan Rudinac

doi:https://doi.org/10.1145/3746263.3764110

From Content-Based Image and Video Retrieval With Relevance Feedback to Multimodal Large Language Models and Beyond: A Journey Through Modalities and Semantic Levels

Authors	Stevan Rudinac
Publication date	2025
Book title	MA-LLM '25
Book subtitle	Proceedings of the ACM MM 2025 Workshop on Multimedia Analytics with Multimodal Large Language Models : October, 27-31, 2025, Dublin, Ireland
ISBN (electronic)	9798400720451
Event	ACM MM 2025 Workshop on Multimedia Analytics with Multimodal Large Language Models
Pages (from-to)	13
Number of pages	1
Publisher	New York, NY: Association for Computing Machinery
Organisations	Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI)
Abstract	The multimedia community has made a long journey from early content-based image and video retrieval systems, which often facilitated interactive search, exploration, and learning using relevance feedback and similar techniques, to modern foundation models, such as multimodal large language models (MLLMs). During that period, multimedia analysis techniques have improved dramatically, maturing from the low semantic levels of colours, textures, and shapes, over semantic concepts, actions, and events to increasingly abstract and complex multimedia content understanding. With the advent of foundation models, the traditional search and recommendation paradigms have been increasingly replaced with multimodal multi-turn conversational search. In this keynote, I will reflect on this journey by showcasing representative works, including those by our team, advocating that the trends surrounding modern MLLMs have been a product of gradual evolution, rather than a revolution. In addition, I will illustrate how MLLMs can be applied to advance business and society by injecting domain knowledge from the other disciplines and facilitating improved interaction with the user. Finally, I will share some thoughts on the potential of recent research developments aiming to improve performance while significantly reducing the size, complexity and cost of MLLMs.
Document type	Conference contribution
Note	Keynote Talk
Language	English
Published at	https://doi.org/10.1145/3746263.3764110 (Final published version)
Downloads	3746263.3764110 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

From Content-Based Image and Video Retrieval With Relevance Feedback to Multimodal Large Language Models and Beyond: A Journey Through Modalities and Semantic Levels