Towards Open-Vocabulary Video Instance Segmentation

H. Wang; C. Yan; S. Wang; X. Jiang; X. Tang; Y. Hu; W. Xie; E. Gavves

doi:https://doi.org/10.48550/arXiv.2304.01715

Towards Open-Vocabulary Video Instance Segmentation

Authors	H. Wang C. Yan S. Wang X. Jiang X. Tang Y. Hu W. Xie E. Gavves
Publication date	2023
Book title	2023 IEEE/CVF International Conference on Computer Vision
Book subtitle	ICCV 2023 : Paris, France, 2-6 October 2023 : proceedings
ISBN	9798350307191
ISBN (electronic)	9798350307184
Event	2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Pages (from-to)	4034-4043
Publisher	Los Alamitos, California: IEEE Computer Society
Organisations	Faculty of Economics and Business (FEB) - Amsterdam Business School Research Institute (ABS-RI) Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.
Document type	Conference contribution
Note	With supplemental file
Language	English
Published at	https://doi.org/10.48550/arXiv.2304.01715 (Accepted author manuscript) https://doi.org/10.1109/ICCV51070.2023.00375 (Final published version)
Published at	https://openaccess.thecvf.com/content/ICCV2023/html/Wang_Towards_Open-Vocabulary_Video_Instance_Segmentation_ICCV_2023_paper.html (Accepted author manuscript)
Other links	https://github.com/haochenheheda/LVVIS https://www.proceedings.com/72328.html
Downloads	Wang_Towards_Open-Vocabulary_Video_Instance_Segmentation_ICCV_2023_paper (Accepted author manuscript)
Supplementary materials	Wang_Towards_Open-Vocabulary_Video_ICCV_2023_supplemental
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Towards Open-Vocabulary Video Instance Segmentation