VOID: Video Object and Interaction Deletion
VOID removes objects from videos along with all interactions they induce on the scene β not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.
Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar β causing it to fall naturally.
teaser-with-name.mp4
π€ Models
VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.
| Model | Description | HuggingFace |
|---|---|---|
| VOID Pass 1 | Base inpainting model | Download |
| VOID Pass 2 | Warped-noise refinement model | Download |
Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).
βΆοΈ Quick Start
The fastest way to try VOID is the included notebook β it handles setup, downloads the models, runs inference on a sample video, and displays the result:
Note: Requires a GPU with 40GB+ VRAM (e.g., A100).
For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.
βοΈ Setup
pip install -r requirements.txt
Stage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:
export GEMINI_API_KEY=your_key_hereAlso install SAM2 separately (required for mask generation):
git clone https://github.com/facebookresearch/sam2.git cd sam2 && pip install -e .
Download the pretrained base inpainting model from HuggingFace:
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InPThe inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.
If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:
ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg
π Expected directory structure
After cloning the repo and downloading all assets, your directory should look like this:
VOID/
βββ config/
βββ datasets/
β βββ void_train_data.json
βββ inference/
βββ sample/ # included sample sequences for inference
βββ scripts/
βββ videox_fun/
βββ VLM-MASK-REASONER/
βββ README.md
βββ requirements.txt
β
βββ CogVideoX-Fun-V1.5-5b-InP/ # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
βββ void_pass1.safetensors # download from huggingface.co/void-model (see Models above)
βββ void_pass2.safetensors # download from huggingface.co/void-model (see Models above)
βββ training_data/ # generated via data_generation/ pipeline (see Training section)
βββ data_generation/ # data generation code (HUMOTO + Kubric pipelines)
π Input Format
Each video sequence lives in its own folder under a root data directory:
data_rootdir/
βββ my-video/
βββ input_video.mp4 # source video
βββ quadmask_0.mp4 # quadmask (4-value mask video, see below)
βββ prompt.json # {"bg": "background description"}
The prompt.json contains a single "bg" key describing the scene after the object has been removed β i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.
{ "bg": "A table with a cup on it." } // β
describes the clean background
{ "bg": "A person being removed from scene." } // β don't describe the removalA few examples from the included samples:
| Sequence | Removed object | bg prompt |
|---|---|---|
lime |
the glass | "A lime falls on the table." |
moving_ball |
the rubber duckie | "A ball rolls off the table." |
pillow |
the kettlebell being placed on the pillow | "Two pillows are on the table." |
The quadmask encodes four semantic regions per pixel:
| Value | Meaning |
|---|---|
0 |
Primary object to remove |
63 |
Overlap of primary + affected regions |
127 |
Affected region (interactions: falling objects, displaced items, etc.) |
255 |
Background (keep) |
π Pipeline
π Stage 1 β Generate Masks
The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.
π±οΈ Step 0 β Select points (GUI)
python VLM-MASK-REASONER/point_selector_gui.py
Load a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.
Config format:
{
"videos": [
{
"video_path": "path/to/video.mp4",
"output_dir": "path/to/output/folder",
"instruction": "remove the person"
}
]
}β‘ Steps 1β4 β Run the full pipeline
After saving the points config, run all remaining stages automatically:
bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json
Optional flags:
bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
--sam2-checkpoint path/to/sam2_hiera_large.pt \
--device cudaThis runs the following stages in order:
| Stage | Script | Output |
|---|---|---|
| 1 β SAM2 segmentation | stage1_sam2_segmentation.py |
black_mask.mp4 |
| 2 β VLM analysis | stage2_vlm_analysis.py |
vlm_analysis.json |
| 3 β Grey mask generation | stage3a_generate_grey_masks_v2.py |
grey_mask.mp4 |
| 4 β Combine into quadmask | stage4_combine_masks.py |
quadmask_0.mp4 |
The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.
π¬ Stage 2 β Inference
VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.
β¨ Pass 1 β Base inference
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="path/to/data_rootdir" \
--config.experiment.run_seqs="my-video" \
--config.experiment.save_path="path/to/output" \
--config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
--config.video_model.transformer_path="path/to/void_pass1.safetensors"To run multiple sequences at once, pass a comma-separated list:
--config.experiment.run_seqs="video1,video2,video3"Key config options:
| Flag | Default | Description |
|---|---|---|
--config.data.sample_size |
384x672 |
Output resolution (HxW) |
--config.data.max_video_length |
197 |
Max frames to process |
--config.video_model.temporal_window_size |
85 |
Temporal window for multidiffusion |
--config.video_model.num_inference_steps |
50 |
Denoising steps |
--config.video_model.guidance_scale |
1.0 |
Classifier-free guidance scale |
--config.system.gpu_memory_mode |
model_cpu_offload_and_qfloat8 |
Memory mode (model_full_load, model_cpu_offload, sequential_cpu_offload) |
The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.
π Pass 2 β Warped noise refinement
Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.
Single video:
python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
--video_name my-video \
--data_rootdir path/to/data_rootdir \
--pass1_dir path/to/pass1_outputs \
--output_dir path/to/pass2_outputs \
--model_checkpoint path/to/void_pass2.safetensors \
--model_name path/to/CogVideoX-Fun-V1.5-5b-InPBatch: Edit the video list and paths in inference/pass_2_refine.sh, then run:
bash inference/pass_2_refine.sh
Key arguments:
| Argument | Default | Description |
|---|---|---|
--pass1_dir |
β | Directory containing Pass 1 output videos |
--output_dir |
./inference_with_warped_noise |
Where to save Pass 2 results |
--warped_noise_cache_dir |
./pass1_warped_noise_cache |
Cache for precomputed warped latents |
--temporal_window_size |
85 |
Temporal window size |
--height / --width |
384 / 672 |
Output resolution |
--guidance_scale |
6.0 |
CFG scale |
--num_inference_steps |
50 |
Denoising steps |
--use_quadmask |
True |
Use quadmask conditioning |
βοΈ Stage 3 β Manual Mask Refinement (Optional)
If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.
python VLM-MASK-REASONER/edit_quadmask.py
Open a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.
Tools:
- Grid Toggle β click a grid cell to toggle the interaction region (
127β255) - Grid Black Toggle β click a grid cell to toggle the primary object region (
0β255) - Brush (Add / Erase) β freehand paint or erase mask regions at pixel level
- Copy from Previous Frame β propagate the black or grey mask from the previous frame
Keyboard shortcuts: β / β navigate frames, Ctrl+Z / Ctrl+Y undo/redo.
Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.
ποΈ Training
Training Data Generation
Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:
Source 1: HUMOTO (Human-Object Interaction)
Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.
Prerequisites:
- HUMOTO dataset β Request access from the authors at adobe-research/humoto. Once approved, download and place under
data_generation/humoto_release/ - Blender β Install Blender (tested with 3.x and 4.x). Also install
opencv-python-headlessin Blender's Python (seedata_generation/README.md) - Remy & Sophie characters β Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", download each as FBX, and place at:
data_generation/human_model/Remy_mixamo_bone.fbx data_generation/human_model/Sophie_mixamo_bone.fbx - PBR textures (optional) β Download texture packs from ambientCG or Poly Haven. Without textures, objects render with realistic solid colors as fallback
Expected directory structure after setup:
data_generation/
βββ humoto_release/
β βββ humoto_0805/ # HUMOTO sequences (.pkl, .fbx, .yaml per sequence)
β βββ humoto_objects_0805/ # Object meshes (.obj, .fbx per object)
βββ human_model/
β βββ Remy_mixamo_bone.fbx # β download from Mixamo
β βββ Sophie_mixamo_bone.fbx # β download from Mixamo
β βββ bone_names.py # included
β βββ *.json # included (bone structure definitions)
βββ textures/ # β optional, user-provided PBR textures
βββ physics_config.json # included (manual per-sequence physics settings)
βββ render_paired_videos_blender_quadmask.py # main renderer
βββ convert_split_remy_sophie.sh # character conversion script
βββ ...
Pipeline:
cd data_generation # 1. Convert HUMOTO sequences to Remy/Sophie characters bash convert_split_remy_sophie.sh # 2. Render paired videos (with human, without human, quad-mask) blender --background --python render_paired_videos_blender_quadmask.py -- \ -d ./humoto_release/humoto_0805 \ -o ./output \ -s <sequence_name> \ -m ./humoto_release/humoto_objects_0805 \ --use_characters --enable_physics --add_walls \ --target_frames 60 --fps 12
A pre-configured physics_config.json is included specifying which objects are static vs. dynamic per sequence. See data_generation/README.md for full details.
Source 2: Kubric (Object-Only Interaction)
Generates counterfactual videos using Kubric with Google Scanned Objects. Objects are launched at a target; removing them alters the target's physics trajectory. No external dataset download required β assets are fetched from Google Cloud Storage.
cd data_generation
pip install kubric pybullet imageio imageio-ffmpeg
python kubric_variable_objects.py --num_pairs 200 --resolution 384Training Data Format
Both pipelines output the same format expected by the training scripts:
training_data/
βββ sequence_name/
βββ rgb_full.mp4 # input video (with object)
βββ rgb_removed.mp4 # target video (object removed, physics applied)
βββ mask.mp4 # quad-mask (0/63/127/255)
βββ metadata.json
Point the training scripts at your generated data by updating datasets/void_train_data.json.
Running Training
Training proceeds in two stages. Pass 1 is trained first, then Pass 2 fine-tunes from that checkpoint.
Pass 1 β Base inpainting model
Does not require warped noise. Trains the model to remove objects and their interactions from scratch.
bash scripts/cogvideox_fun/train_void.sh
Key arguments:
| Argument | Description |
|---|---|
--pretrained_model_name_or_path |
Path to base CogVideoX inpainting model |
--transformer_path |
Optional starting checkpoint |
--train_data_meta |
Path to dataset metadata JSON |
--train_mode="void" |
Enables void inpainting training mode |
--use_quadmask |
Trains with 4-value quadmask conditioning |
--use_vae_mask |
Encodes mask through VAE |
--output_dir |
Where to save checkpoints |
--num_train_epochs |
Number of epochs |
--checkpointing_steps |
Save a checkpoint every N steps |
--learning_rate |
Default 1e-5 |
Pass 2 β Warped noise refinement model
Continues training from a Pass 1 checkpoint with optical flow-warped latent initialization, improving temporal consistency on longer videos. Requires warped noise for training data to be present.
bash scripts/cogvideox_fun/train_void_warped_noise.sh
Set TRANSFORMER_PATH to your Pass 1 checkpoint before running:
TRANSFORMER_PATH=path/to/pass1_checkpoint.safetensors bash scripts/cogvideox_fun/train_void_warped_noise.sh
Additional arguments specific to this stage:
| Argument | Description |
|---|---|
--use_warped_noise |
Enables warped latent initialization during training |
--warped_noise_degradation |
Noise blending factor (default 0.3) |
--warped_noise_probability |
Fraction of steps using warped noise (default 1.0) |
Training was run on 8Γ A100 80GB GPUs using DeepSpeed ZeRO stage 2.
π€© Community Adoption
We are excited to see the community build on VOID!
Below we showcase selected demos, tools, and extensions.
If youβve built something using VOID, feel free to submit a PR to add it here.
π Demos & Projects
- β Gradio Demo β @sam-motamed
Interactive demo for trying VOID in the browser:
π https://huggingface.co/spaces/sam-motamed/VOID
π Acknowledgements
This implementation builds on code and models from aigc-apps/VideoX-Fun, Gen-Omnimatte, Go-with-the-Flow, Kubric and HUMOTO. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX, Gen-Omnimatte, and the optical flow warping utilities.
Star History
π Citation
If you find our work useful, please consider citing:
π https://arxiv.org/abs/2604.02296
@misc{motamed2026void, title={VOID: Video Object and Interaction Deletion}, author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng}, year={2026}, eprint={2604.02296}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.02296} }
