Vision Banana | Google DeepMind

5 min read Original article ↗

Overview

🏆

Vision Banana is a SOTA unified model for both image understanding and generation.

🧠

Generative vision pretraining is an effective paradigm for visual understanding.

🔗

Image generation serves as a universal interface for diverse vision tasks.

Vision Banana overview: from generative pretraining to vision understanding

Capabilities

Hover over any image to reveal Vision Banana's generation results. On mobile, tap to toggle.

Semantic Segmentation

Hover to reveal segmentation masks

Instance Segmentation

Hover to reveal instance masks

Referring Expression Segmentation

Hover to reveal referred object masks

Monocular Metric Depth Estimation

Hover to reveal depth maps

Surface Normal Estimation

Hover to reveal surface normal maps

Results

Vision Banana achieves state-of-the-art under the zero-shot transfer setting across 2D and 3D vision tasks.

2D Understanding

SegMan-L
(Non Zero-Shot)
APE-D OpenSeeD X-Decoder SAM 3 Vision Banana 🍌

SAM 3
(Non Zero-Shot)
APE-D OWLv2 Gemini 2.5 Vision Banana 🍌 DINO-X

* Evaluated on 500 randomly sampled queries.

HyperSeg
+ Phi2
(Non Zero-Shot)
X-SAM
+ Phi3
(Non Zero-Shot)
HybridGL Kang
+ LLaVA
SAM 3
+ Gemini 2.5 Pro
Vision Banana 🍌

X-SAM
+ Phi3 3.8B
(Non Zero-Shot)
LISA-13B-LLAVA1.5(Non Zero-Shot) SegZero RSVP
+ GPT-4o
SAM 3
+ Gemini 2.5 Pro
Vision Banana 🍌
+ Gemini 2.5 Pro

Methods paired with MLLMs for reasoning.

3D Understanding

Depth Pro MoGe-2 UniK3D Vision Banana 🍌

Vision Banana does not use camera intrinsics in training or inference.

Marigold StableNormal DSINE Lotus-2 Vision Banana 🍌

Contributors


Project Leads

Valentin Gabeur*  ·  Shangbang Long*  ·  Songyou Peng*

* Equal contribution

Core Contributors

Paul Voigtlaender  ·  Shuyang Sun  ·  Yanan Bao  ·  Karen Truong  ·  Zhicheng Wang  ·  Wenlei Zhou  ·  Jonathan T. Barron  ·  Kyle Genova  ·  Nithish Kannen  ·  Sherry Ben  ·  Yandong Li  ·  Mandy Guo  ·  Suhas Yogin

Project Advisors

Yiming Gu  ·  Huizhong Chen

Leadership Sponsors

Oliver Wang  ·  Saining Xie  ·  Howard Zhou  ·  Kaiming He  ·  Thomas Funkhouser  ·  Jean-Baptiste Alayrac  ·  Radu Soricut

Acknowledgements

We thank Xi Chen, Fei Xia, Kaushik Shivakumar, Abhishek Sinha, Phillip Lippe, Yilin Gao, Javier Rey, Sanghyun Woo, Renshen Wang, Wentao Yuan, Keran Rong, Rundi Wu, Manoj Kumar, Manli Shu, Francesco Piccinno, Ishita Dasgupta, Benigno Uria, Miki Rubinstein, Aäron van den Oord, and Jon Shlens for their helpful discussions, advice, and technical guidance.

BibTeX

@article{visionbanana2026,
  title={Image Generators are Generalist Vision Learners},
  author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and Zhou, Howard and He, Kaiming and Funkhouser, Thomas and Alayrac, Jean-Baptiste and Soricut, Radu},
  journal={arXiv preprint arXiv:2604.20329},
  year={2026}
}