Motion-Augmented Self-Training for Video Recognition at Smaller Scale

K. Gavrilyuk; M. Jain; I. Karmanov; C.G.M. Snoek

doi:https://doi.org/10.1109/ICCV48922.2021.01026

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

Authors	K. Gavrilyuk M. Jain I. Karmanov C.G.M. Snoek
Publication date	2021
Book title	2021 IEEE/CVF International Conference on Computer Vision
Book subtitle	proceedings : ICCV 2021 : 11-17 October 2021, virtual event
ISBN	9781665428132
ISBN (electronic)	9781665428125
Series	International Conference on Computer Vision
Event	2021 IEEE/CVF International Conference on Computer Vision
Pages (from-to)	10409-10418
Publisher	Los Alamitos, California: IEEE Computer Society
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our network using optical flow, but avoid its computation during inference. We propose the first motion-augmented self-training regime, we call MotionFit. We start with supervised training of a motion model on a small, and labeled, video collection. With the motion model we generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model. Moreover, we introduce a multi-clip loss as a simple yet efficient way to improve the quality of the pseudo-labeling, even without additional auxiliary tasks. We also take into consideration the temporal granularity of videos during self-training of the appearance model, which was missed in previous works. As a result we obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval. On small-scale video datasets, MotionFit outperforms alternatives for knowledge transfer by 5%-8%, video-only self-supervision by 1%-7% and semi-supervised learning by 9%-18% using the same amount of class labels.
Document type	Conference contribution
Note	With supplementary material.
Language	English
Published at	https://doi.org/10.1109/ICCV48922.2021.01026 (Final published version)
Published at	https://openaccess.thecvf.com/content/ICCV2021/html/Gavrilyuk_Motion-Augmented_Self-Training_for_Video_Recognition_at_Smaller_Scale_ICCV_2021_paper.html (Accepted author manuscript)
Other links	https://www.proceedings.com/61354.html
Downloads	Gavrilyuk_Motion-Augmented_Self-Training_for_Video_Recognition_at_Smaller_Scale_ICCV_2021_paper (Accepted author manuscript) Motion-Augmented_Self-Training_for_Video_Recognition_at_Smaller_Scale (Final published version)
Supplementary materials	Gavrilyuk_Motion-Augmented_Self-Training_for_ICCV_2021_supplemental
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Motion-Augmented Self-Training for Video Recognition at Smaller Scale