LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

LoVoRA: Text-guided and Mask-free Video Object Removal and
Addition with Learnable Object-aware Localization

Zhihan Xiao¹, Lin Liu^{2^✉†}, Yixin Gao³, Xiaopeng Zhang², Haoxuan Che², Songping Mai^{1^✉}, Qi Tian²

¹Tsinghua University, ²Huawei Inc., ³University of Science and Technology of China
^†Project Leader, ^✉Corresponding authors

Abstract

Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using an object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical-flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.

LoVoRA Dataset

Dataset Examples

Erase a pair of floral gardening gloves lying open on the sunny rooftop garden table.

Place a teddy bear on the lower bunk.

Replace the beanie with a classic cap.

Dataset Comparison

Our dataset emphasizes high-resolution, temporally consistent object manipulation with optical flow guided mask propagation. We also report VLM evaluation results using MiniCPM-V2.6. PF denotes Prompt Following and EQ denotes Edit Quality.

Dataset	PF	EQ	Generation Basis
InsV2V	--	--	Modified Prompt-to-Prompt method
ICVE-SFT	--	--	Source object removal and inpainting
Senorita-2M	3.533	3.883	Source object removal and inpainting
InsViE-1M	3.133	3.667	Video inversion and reconstruction guidance
Ditto	4.417	4.733	Depth-guided generation from source video
Ours	4.375	4.850	Optical flow guided mask propagation

Dataset

Generation Basis

InsV2V

Modified Prompt-to-Prompt method

ICVE-SFT

Source object removal and inpainting

Senorita-2M

3.533

3.883

Source object removal and inpainting

InsViE-1M

3.133

3.667

Video inversion and reconstruction guidance

Ditto

4.417

4.733

Depth-guided generation from source video

Ours

4.375

4.850

Optical flow guided mask propagation

Object Removal

Remove the entire wooden boardwalk from the scene.

Remove the wooden stakes in the front of the horse.

Remove the fishing boat on the left.

Object Addition

Add a pair of realistic sunglasses to the man in the video.

Add a square red flag on the top of the boat.

Put a white helmet on the head of the hockey player.

Method Overview

Overview of LoVoRA dataset construction pipeline. Starting from high-quality image editing pairs, we synthesize instruction-based video editing data through five: I2V translation, mask generation, optical flow estimation, mask propagation, and video inpainting.

Overview of the proposed LoVoRA framework. The input video is encoded by a spatio-temporal VAE to produce latents. Encoded latents are channel-concatenated with noisy target latents and processed by a DiT backbone to predict the rectified-flow velocity field. A Diffusion Mask Predictor reads selected DiT token features and predicts a spatio-temporal diff mask used during training.

LoVoRA: Text-guided and Mask-free Video Object Removal andAddition with Learnable Object-aware Localization