OV-VIS: Open-Vocabulary Video Instance Segmentation

H. Wang; C. Yan; K. Chen; X. Jiang; X. Tang; Y. Hu; G. Kang; W. Xie; E. Gavves

doi:https://doi.org/10.1007/s11263-024-02076-w

OV-VIS: Open-Vocabulary Video Instance Segmentation

Authors	H. Wang C. Yan K. Chen X. Jiang X. Tang Y. Hu G. Kang W. Xie E. Gavves
Publication date	11-2024
Journal	International Journal of Computer Vision
Volume \| Issue number	132 \| 11
Pages (from-to)	5048-5065
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Conventionally, the goal of Video Instance Segmentation (VIS) is to segment and categorize objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation (OV-VIS), which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark OV-VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1196 diverse categories, significantly surpassing the category size of existing datasets by more than an order of magnitude. Third, we propose a transformer-based OV-VIS model, OV2Seg+, which associates per-frame segmentation masks with a memory-induced transformer and clarifies objects in videos with a voting module given language guidance. In addition, to monitor the progress, we set up the evaluation protocols for OV-VIS and propose a set of strong baseline models to facilitate future endeavors. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg+. The dataset and code are released here https://github.com/haochenheheda/LVVIS. The competition website is provided here https://www.codabench.org/competitions/1748.
Document type	Article
Language	English
Published at	https://doi.org/10.1007/s11263-024-02076-w (Final published version)
Other links	https://github.com/haochenheheda/LVVIS
Downloads	OV-VIS (Final published version)
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

OV-VIS: Open-Vocabulary Video Instance Segmentation