Object-Centric Diffusion for Efficient Video Editing

K. Kahatapitiya; A. Karjauv; D. Abati; F. Porikli; Y.M. Asano; A. Habibian

doi:https://doi.org/10.1007/978-3-031-72998-0_6

Object-Centric Diffusion for Efficient Video Editing

Authors	K. Kahatapitiya A. Karjauv D. Abati F. Porikli Y.M. Asano A. Habibian
Publication date	2025
Host editors	A. Leonardis E. Ricci S. Roth O. Russakovsky T. Sattler G. Varol
Book title	Computer Vision – ECCV 2024
Book subtitle	18th European Conference, Milan, Italy, September 29–October 4, 2024 : proceedings
ISBN	9783031729973
ISBN (electronic)	9783031729980
Series	Lecture Notes in Computer Science
Event	The 18th European Conference on Computer Vision ECCV 2024
Volume \| Issue number	LVII
Pages (from-to)	91–108
Publisher	Cham: Springer
Organisations	Faculty of Science (FNWI) - Informatics Institute (IVI)
Abstract	Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, to fix generation artifacts and further reduce latency by allocating more computations towards foreground edited regions, arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient or background regions and spending most on the former, and ii) Object-Centric Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model without retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10× for a comparable synthesis quality. Project page: qualcomm-ai-research.github.io/object-centric-diffusion.
Document type	Conference contribution
Note	With supplementary material
Language	English
Published at	https://doi.org/10.1007/978-3-031-72998-0_6
Other links	http://qualcomm-ai-research.github.io/object-centric-diffusion
Permalink to this page

Back

UvA-DARE

Digital Academic Repository

Object-Centric Diffusion for Efficient Video Editing