Method Overview
Overview of LoVoRA dataset construction pipeline. Starting from high-quality image editing pairs, we synthesize instruction-based video editing data through five: I2V translation, mask generation, optical flow estimation, mask propagation, and video inpainting.
Overview of the proposed LoVoRA framework. The input video is encoded by a spatio-temporal VAE to produce latents. Encoded latents are channel-concatenated with noisy target latents and processed by a DiT backbone to predict the rectified-flow velocity field. A Diffusion Mask Predictor reads selected DiT token features and predicts a spatio-temporal diff mask used during training.