Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors
| Authors |
|
|---|---|
| Publication date | 2022 |
| Book title | ICMR '22 |
| Book subtitle | proceedings of the 2022 International Conference on Multimedia Retrieval : June 27-30, 2022, Newark, NJ, USA |
| ISBN (electronic) |
|
| Event | 2022 International Conference on Multimedia Retrieval |
| Pages (from-to) | 176-184 |
| Publisher | New York, NY: The Association for Computing Machinery |
| Organisations |
|
| Abstract |
This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.
|
| Document type | Conference contribution |
| Note | With supplementary video |
| Language | English |
| Published at | https://doi.org/10.1145/3512527.3531376 |
| Permalink to this page | |