Video-efficient foundation models
| Authors | |
|---|---|
| Supervisors | |
| Cosupervisors | |
| Award date | 08-12-2023 |
| Number of pages | 136 |
| Organisations |
|
| Abstract |
The thesis strives to endow video-efficiency in video understanding by addressing the research question ''What enables video-efficient video foundation models?'' Video-efficiency encompasses developing video foundation models that are not only accurate but also exhibit label-efficiency i.e. require fewer labels, domain-efficiency i.e. applicable to a variety of video learning scenarios, and data-efficiency i.e. reduce the amount of video data needed for learning. The research question is addressed for RGB and non-RGB video modalities. In Chapter 2, we focus on improving the label- and domain-efficiency of non-RGB action recognition and detection. Chapter 3 introduces a new self-supervised approach for learning feature representations for 3D-skeleton video sequences. In Chapter 4, we conduct a large-scale study of existing RGB-based self-supervised video models to assess their performance across different facets of video-efficiency. Chapter 5 presents a new method for video self-supervision that explicitly aims to learn motion focused video-representations. To summarize, this thesis presents several novel approaches to improve the video-efficiency of video foundation models. Our research highlights the importance of transferring knowledge between RGB and non-RGB video modalities, exploring self-supervision for non-RGB video modeling, analyzing self-supervised models beyond canonical setups and carefully designing new self-supervised tasks to develop video foundation models that can exhibit different facets of video-efficiency. We hope that our work will inspire further research and development in this area, leading to even more video-efficient foundation models.
|
| Document type | PhD thesis |
| Language | English |
| Downloads | |
| Permalink to this page | |
