<i>Teaching a New Dog Old Tricks</i>: Contrastive Random Walks in Videos with Unsupervised Priors

doi:https://doi.org/10.1145/3512527.3531376

Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors

Authors	J. Schutte P. Mettes
Publication date	2022
Book title	ICMR '22
Book subtitle	proceedings of the 2022 International Conference on Multimedia Retrieval : June 27-30, 2022, Newark, NJ, USA
ISBN (electronic)	9781450392389
Event	2022 International Conference on Multimedia Retrieval
Pages (from-to)	176-184
Publisher	New York, NY: The Association for Computing Machinery
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.
Document type	Conference contribution
Note	With supplementary video
Language	English
Published at	https://doi.org/10.1145/3512527.3531376
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors