Vision Banana | Google DeepMind

5 min read Original article ↗

Overview

🏆

Vision Banana is a SOTA unified model for both image understanding and generation.

🧠

Generative vision pretraining is an effective paradigm for visual understanding.

🔗

Image generation serves as a universal interface for diverse vision tasks.

Vision Banana overview: from generative pretraining to vision understanding

Capabilities

Hover over any image to reveal Vision Banana's generation results. On mobile, tap to toggle.

Semantic Segmentation

Input image

Semantic segmentation prediction

Input Segmentation

Prompt: This image is a per-pixel class labeling of the input. The macaron cakes are represented by (255, 255, 0). The round plates are represented by (255, 192, 128). The slice cakes are depicted in (64, 192, 64). The flowers are shown in (128, 0, 64). The tongs are (255, 0, 192).

Input image

Semantic segmentation prediction

Input Segmentation

Prompt: Generate a visualization image of semantic segmentation, using this color mapping: {"cat ears": <255, 165, 0>, "exit sign": <0, 0, 255>, "background": <125, 0, 125>}

Input image

Semantic segmentation prediction

Input Segmentation

Prompt: Conduct per-class semantic segmentation for the given image. The sitting person are represented by (255, 255, 0). The standing and walking people are represented by (255, 192, 128). The ocean is depicted in (64, 192, 64). The street lights are in (128, 0, 64). The sky is in (255, 0, 192). The fence is in (0, 0, 255). The backpack is in (255, 0, 0).

Hover to reveal segmentation masks

Instance Segmentation

Input image

Instance segmentation prediction

Input Segmentation

Prompt: Generate an instance segmentation visualization of this image. Each piece of garlic is colored differently.

Input image

Instance segmentation prediction

Input Segmentation

Prompt: Generate an instance segmentation visualization of the input image. Segment all the price on the price tags, color them differently.

Input image

Instance segmentation prediction

Input Segmentation

Prompt: Generate an instance segmentation visualization of this image. Each price tag is colored differently.

Input image

Instance segmentation prediction

Input Segmentation

Prompt: This image shows segmentation masks for the basketballs from the input image. The background is set to #10aa05. Each basketball instance is represented by a solid circular mask, and a different color is used for each mask.

Hover to reveal instance masks

Referring Expression Segmentation

Input image

Referring segmentation prediction

Input Segmentation

Prompt: This image shows segmentation masks from the given image. The background is black color. The chef's names in both Chinese and English are rendered as cyan color.

Input image

Referring segmentation prediction

Input Segmentation

Prompt: A segmentation map image. The stretching cat is rendered in green, the cat that is cleaning itself is in cyan.

Input image

Referring segmentation prediction

Input Segmentation

Prompt: This image shows segmentation masks from the given image. The background is black color. The game control device is represented by a solid yellow.

Input image

Referring segmentation prediction

Input Segmentation

Prompt: A segmentation map image. The area that corresponds to the man in pink t shirt is rendered solid white; the other man is rendered in green.

Input image

Referring segmentation prediction

Input Segmentation

Prompt: A segmentation map of the input image. The pig not in the glass is rendered cyan, and the pig in the glass reflection is rendered yellow.

Hover to reveal referred object masks

Monocular Metric Depth Estimation

Input image

Depth prediction

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow colormap.

Input image

Depth prediction

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow (black-red-yellow-green-cyan-blue-violet-white) color palette.

Input image

Depth prediction

Input Depth

Prompt: Predict the metric depth of this scene as an image. Visualized in the rainbow (black-red-yellow-green-cyan-blue-violet-white) color palette.

Input image

Depth prediction

Input Depth

Prompt: Generate a metric depth map of the provided image.

Input image

Depth prediction

Input Depth

Prompt: Generate a metric depth map of the input image.

Input image

Depth prediction

Input Depth

Prompt: Generate a metric depth map of the input image.

Input image

Depth prediction

Input Depth

Prompt: Generate a metric depth map of the input image.

Hover to reveal depth maps

Surface Normal Estimation

Input image

Surface Normal prediction

Input Surface Normal

Prompt: Predict the surface normal of this scene.

Input image

Surface Normal prediction

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Input image

Surface Normal prediction

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Input image

Surface Normal prediction

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Input image

Surface Normal prediction

Input Surface Normal

Prompt: Generate a surface normal map of the input image.

Hover to reveal surface normal maps

Results

Vision Banana achieves state-of-the-art under the zero-shot transfer setting across 2D and 3D vision tasks.

2D Understanding

SegMan-L
(Non Zero-Shot) APE-D OpenSeeD X-Decoder SAM 3 Vision Banana

SAM 3
(Non Zero-Shot) APE-D OWLv2 Gemini 2.5 Vision Banana DINO-X

* Evaluated on 500 randomly sampled queries.

HyperSeg
+ Phi2
(Non Zero-Shot) X-SAM
+ Phi3
(Non Zero-Shot) HybridGL Kang
+ LLaVA SAM 3
+ Gemini 2.5 Pro Vision Banana

X-SAM
+ Phi3 3.8B
(Non Zero-Shot) LISA-13B-LLAVA1.5(Non Zero-Shot) SegZero RSVP
+ GPT-4o SAM 3
+ Gemini 2.5 Pro Vision Banana
+ Gemini 2.5 Pro

Methods paired with MLLMs for reasoning.

3D Understanding

Depth Pro MoGe-2 UniK3D Vision Banana

Vision Banana does not use camera intrinsics in training or inference.

Marigold StableNormal DSINE Lotus-2 Vision Banana

Contributors

Project Leads

Valentin Gabeur* · Shangbang Long* · Songyou Peng*

* Equal contribution

Core Contributors

Paul Voigtlaender · Shuyang Sun · Yanan Bao · Karen Truong · Zhicheng Wang · Wenlei Zhou · Jonathan T. Barron · Kyle Genova · Nithish Kannen · Sherry Ben · Yandong Li · Mandy Guo · Suhas Yogin

Project Advisors

Yiming Gu · Huizhong Chen

Leadership Sponsors

Oliver Wang · Saining Xie · Howard Zhou · Kaiming He · Thomas Funkhouser · Jean-Baptiste Alayrac · Radu Soricut

Acknowledgements

We thank Xi Chen, Fei Xia, Kaushik Shivakumar, Abhishek Sinha, Phillip Lippe, Yilin Gao, Javier Rey, Sanghyun Woo, Renshen Wang, Wentao Yuan, Keran Rong, Rundi Wu, Manoj Kumar, Manli Shu, Francesco Piccinno, Ishita Dasgupta, Benigno Uria, Miki Rubinstein, Aäron van den Oord, and Jon Shlens for their helpful discussions, advice, and technical guidance.

BibTeX

@article{visionbanana2026,
  title={Image Generators are Generalist Vision Learners},
  author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and Zhou, Howard and He, Kaiming and Funkhouser, Thomas and Alayrac, Jean-Baptiste and Soricut, Radu},
  journal={arXiv preprint arXiv:2604.20329},
  year={2026}
}