SIGMA: Sinkhorn-Guided Masked Video Modeling

S. Salehi; M. Dorkenwald; F.M. Thoker; E. Gavves; C.G.M. Snoek; Y.M. Asano

doi:https://doi.org/10.1007/978-3-031-72691-0_17

SIGMA: Sinkhorn-Guided Masked Video Modeling

Authors	S. Salehi M. Dorkenwald F.M. Thoker E. Gavves C.G.M. Snoek Y.M. Asano
Publication date	2025
Host editors	A. Leonardis E. Ricci S. Roth O. Russakovsky T. Sattler G. Varol
Book title	Computer Vision – ECCV 2024
Book subtitle	18th European Conference, Milan, Italy, September 29–October 4, 2024 : proceedings
ISBN	9783031726903
ISBN (electronic)	9783031726910
Series	Lecture Notes in Computer Science
Event	The 18th European Conference on Computer Vision ECCV 2024
Volume \| Issue number	XXIV
Pages (from-to)	293-312
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (sigma), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of sigma in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: https://quva-lab.github.io/SIGMA.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1007/978-3-031-72691-0_17 (Final published version)
Downloads	SIGMA (Final published version)
Supplementary materials	Supplementary Material
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

SIGMA: Sinkhorn-Guided Masked Video Modeling