GitHub - kyutai-labs/ovie: Official implementation and models for OVIE (One View Is Enough! Monocular Training for In-the-Wild Novel View Generation)

4 min read Original article โ†—

Monocular Training for In-the-Wild Novel View Generation

Project Page Paper Model License

This repository contains the official implementation and models for OVIE (One View Is Enough! Monocular Training for In-the-Wild Novel View Generation).

OVIE is a framework for monocular novel view synthesis that does not require multi-view image pairs for supervision. Instead, it is trained entirely on unpaired internet images.

OVIE teaser


๐Ÿ—‚๏ธ Table of Contents


๐Ÿ› ๏ธ Installation

We use uv by Astral to manage the Python environment and dependencies. It is a drastically faster drop-in replacement for standard Python packaging tools.

1. Install uv: For macOS/Linux:

curl -LsSf https://astral.sh/uv/install.sh | sh

(Alternatively, you can install it via macOS Homebrew: brew install uv, or refer to the official documentation for Windows instructions.)

2. Clone this repository and sync dependencies: Once uv is installed, clone the project and run uv sync. This will automatically resolve the required Python version (3.10.9) and install all dependencies from uv.lock.

git clone https://github.com/AdrienRR/ovie.git
cd ovie
uv sync

Prefix all commands with uv run to ensure they run inside the managed environment.


๐Ÿ“ฅ Model Weights

Pretrained weights are hosted on the Hugging Face Hub at kyutai/ovie and are downloaded automatically when using from_pretrained (see Inference below).

For evaluation and training, local checkpoint files are also required. Download them from the Releases page and place them inside the assets/ folder:

  • ovie.pt โ€” main checkpoint used for evaluation (contains EMA weights).
  • dino_vit_small_patch8_224.pth โ€” used only for training; same checkpoint as in RAE.
OVIE/
โ”œโ”€โ”€ assets/
โ”‚   โ”œโ”€โ”€ ovie.pt                         # evaluation
โ”‚   โ”œโ”€โ”€ dino_vit_small_patch8_224.pth  # training only
โ”‚   โ””โ”€โ”€ sample_image.jpg
โ”œโ”€โ”€ configs/
โ”‚   โ””โ”€โ”€ config_ovie.yaml
โ”œโ”€โ”€ models/
โ””โ”€โ”€ ...

๐Ÿš€ Inference

We provide two Jupyter notebooks to get started quickly:

Notebook Weights source
inference_huggingface.ipynb Downloaded automatically from kyutai/ovie
inference_local.ipynb Loaded from a local assets/ovie.pt checkpoint
uv run jupyter notebook inference_huggingface.ipynb

Loading from the Hugging Face Hub (recommended):

import torch
from models.models import OVIEModel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = OVIEModel.from_pretrained("kyutai/ovie", revision="v1.0").to(device)
model.eval()
image_size = model.image_size  # 256, read from the saved config

Loading from a local checkpoint:

import yaml, torch
from models.models import OVIE_models

with open("./configs/config_ovie.yaml") as f:
    config = yaml.safe_load(f)

model_cfg = config["model"]
image_size = config["data"]["image_size"]

model = OVIE_models[model_cfg["model_type"]](
    image_size=image_size,
    vit_use_qknorm=model_cfg.get("use_qknorm", False),
    vit_use_swiglu=model_cfg.get("use_swiglu", True),
    vit_use_rope=model_cfg.get("use_rope", False),
    vit_use_rmsnorm=model_cfg.get("use_rmsnorm", True),
    vit_wo_shift=model_cfg.get("wo_shift", False),
    vit_use_checkpoint=model_cfg.get("use_checkpoint", False),
).to(device)

ckpt = torch.load("./assets/ovie.pt", map_location="cpu")
model.load_state_dict(ckpt["ema"])
model.eval()

Running inference:

from torchvision.transforms import ToTensor
from PIL import Image
from utils.pose_enc import extri_intri_to_pose_encoding

img_pil = Image.open("./assets/sample_image.jpg").convert("RGB").resize((image_size, image_size))
img_tensor = ToTensor()(img_pil).unsqueeze(0).to(device)

extrinsics = torch.tensor([[[1.0, 0.0, 0.0, -1.25],
                            [0.0, 1.0, 0.0,  0.5],
                            [0.0, 0.0, 1.0, -2.0]]], device=device)
dummy_intrinsics = torch.zeros(1, 1, 3, 3, device=device)

camera = extri_intri_to_pose_encoding(
    extrinsics=extrinsics.unsqueeze(0),
    intrinsics=dummy_intrinsics,
    image_size_hw=(image_size, image_size),
)
cam_token = camera[..., :7].squeeze(0)

with torch.no_grad():
    pred_tensor = model(x=img_tensor, cam_params=cam_token)

๐Ÿงน Data Preprocessing

Before training or evaluating on specific datasets, raw images must be preprocessed. We provide scripts for both in-the-wild training data and DL3DV evaluation data.

For in-the-wild training images:

uv run python data_preparation/preprocess_in_the_wild_images.py \
    --data_path /PATH/TO/RAW/DATASET \
    --output_path /PATH/TO/PREPROCESSED/DATASET

Point the resulting directories to the data_path lists in configs/config_ovie.yaml.

For DL3DV evaluation data:

uv run python data_preparation/format_dl3dv.py \
    --root_dir /PATH/TO/DL3DV \
    --output_dir /PATH/TO/PROCESSED/DL3DV

DL3DV can be downloaded from the official dataset repository.


๐Ÿ‹๏ธโ€โ™‚๏ธ Training

Once data is preprocessed and paths are set in the config, launch distributed training with torchrun:

uv run torchrun --nproc_per_node <number_of_gpus> train.py --config configs/config_ovie.yaml

๐Ÿ“Š Evaluation

Use evaluate.py to evaluate on benchmark datasets. Requires a local assets/ovie.pt checkpoint (see Model Weights).

Evaluating on Real Estate 10K (RE10K): The pre-processed RE10K dataset is available on Hugging Face: chenchenshi/re10k-sc.

uv run python evaluate.py \
    --dataset_path /PATH/TO/EVAL/DATASET \
    --config_path configs/config_ovie.yaml \
    --checkpoint_path assets/ovie.pt

๐Ÿ”ง Contributing

This project uses pre-commit hooks to enforce code style (ruff format + lint) and keep the lockfile in sync. CI runs the same checks on every push and pull request.

Install the hooks:

uv run pre-commit install

After this, ruff format, ruff check, and uv lock --check run automatically on every git commit. You can also run them manually across all files:

uv run pre-commit run --all-files

The uv.lock file is committed to the repository โ€” do not remove it from version control.


๐Ÿค Acknowledgments and Citation

This project relies on fantastic open-source tools and models, including:

If you find our work useful in your research, please consider citing:

@misc{ovie2026,
      title={One View Is Enough! Monocular Training for In-the-Wild Novel View Generation},
      author={Adrien Ramanana Rahary and Nicolas Dufour and Patrick Perez and David Picard},
      year={2026},
      eprint={2603.23488},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.23488},
}