ActionBytes: Learning from Trimmed Videos to Localize Actions

M. Jain; A. Ghodrati; C.G.M. Snoek

doi:https://doi.org/10.1109/CVPR42600.2020.00125

ActionBytes: Learning from Trimmed Videos to Localize Actions

Authors	M. Jain A. Ghodrati C.G.M. Snoek
Publication date	2020
Book title	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Book subtitle	proceedings : virtual, 14-19 June 2020
ISBN	9781728171692
ISBN (electronic)	9781728171685
Series	CVPR
Event	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pages (from-to)	1168-1177
Publisher	Los Alamitos, California: IEEE Computer Society
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	This paper tackles the problem of localizing actions in long untrimmed videos. Different from existing works, which all use annotated untrimmed videos during training, we learn only from short trimmed videos. This enables learning from large-scale datasets originally designed for action classification. We propose a method to train an action localization network that segments a video into interpretable fragments, we call ActionBytes. Our method jointly learns to cluster ActionBytes and trains the localization network using the cluster assignments as pseudo-labels. By doing so, we train on short trimmed videos that become untrimmed for ActionBytes. In isolation, or when merged, the ActionBytes also serve as effective action proposals. Experiments demonstrate that our boundary-guided training generalizes to unknown action classes and localizes actions in long videos of Thumos14, MultiThumos, and ActivityNet1.2. Furthermore, we show the advantage of ActionBytes for zero-shot localization as well as traditional weakly supervised localization, that train on long videos, to achieve state-of-the-art results.
Document type	Conference contribution
Language	English
Published at	https://doi.org/10.1109/CVPR42600.2020.00125
Downloads	09157526 (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

ActionBytes: Learning from Trimmed Videos to Localize Actions