GitHub - alibaba-damo-academy/RynnBrain: RynnBrain: Open Embodied Foundation Models

RynnBrain

💫 Project Page | Models & Bench 🤗 🤖 | 🚀 Demo | 📚 Cookbooks

📰 News

[2026.02.15] 🔥🔥 Release our Technical Report !!
[2026.02.09] 🔥🔥 Release our code and model checkpoints!!

Introduction

We present RynnBrain, an embodied foundation model grounded in physical reality. RynnBrain is available in two dense variants (2B and 8B) and one mixture-of-experts (MoE) model (30B-A3B). In addition, we release three post‑trained models: RynnBrain‑Plan (robot task planning), RynnBrain‑Nav (vision-language navigation), and RynnBrain‑CoP (chain-of-point reasoning).

🌟 Key Highlights

Comprehensive egocentric understanding: Excels in fine-grained video understanding and egocentric cognition, covering tasks such as embodied QA, counting, and OCR.
Diverse spatio-temporal localization: Possesses powerful localization capabilities across episodic memory, enabling precise identification of objects, target areas, and motion trajectories.
Physical-space reasoning: Employs an interleaved reasoning strategy that alternates between textual and spatial grounding, ensuring that its reasoning processes are firmly rooted in the physical environment.
Physics-aware precise planning: Integrates located affordances and object information into planning, enabling downstream VLA models to execute intricate tasks with fine-grained instructions.

Model Architecture

RynnBrain employs a unified encoder-decoder architecture (supporting both Dense and MoE variants) to transform omni-vision inputs and textual instructions into multi-modal outputs, including spatial trajectories, physical pointing, and action planning. Through massive training on rich spatio-temporal, physical-space, and general knowledge data, RynnBrain maintains robust general-purpose capabilities while specializing in diverse, fine-grained embodied reasoning and complex planning tasks.

Performance

General Embodied Understanding

Robot Task Planning

Vision-Language Navigation

Model Zoo

Model	Base Model	HuggingFace	ModelScope
RynnBrain-2B	Qwen3-VL-2B-Instruct	Link	Link
RynnBrain-8B	Qwen3-VL-8B-Instruct	Link	Link
RynnBrain-30B-A3B	Qwen3-VL-30B-A3B-Instruct	Link	Link
RynnBrain‑CoP-8B	RynnBrain-8B	Link	Link
RynnBrain‑Plan-8B	RynnBrain-8B	Link	Link
RynnBrain‑Plan-30B-A3B	RynnBrain-30B-A3B	Link	Link
RynnBrain‑Nav-8B	RynnBrain-8B	Link	Link

Quick Start

Minimal dependencies:

pip install transformers==4.57.1

Run text generation:

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("")
...

Cookbooks

Checkout the cookbooks that showcase RynnBrain's capabilities in cognition, localization, reasoning, and planning.

Category	Cookbook name	Description
Cognition	1_spatial_understanding.ipynb	Shows the model's ability for spatial understanding in the video scene.
Cognition	2_object_understanding.ipynb	Shows how the model understands object categories, attributes, and relations and counting ability.
Cognition	3_ocr.ipynb	Examples of optical character recognition and text understanding in videos.
Location	4_object_location.ipynb	Locates specific objects with bounding boxes in an image or video based on instructions.
Location	5_area_location.ipynb	Identifies and marks specified regions by points in an image or video.
Location	6_affordance_location.ipynb	Finds areas or objects with specific affordances in an image or video.
Location	7_trajectory_location.ipynb	Infers and annotates trajectories or motion paths in an image or video.
Location	8_grasp_pose.ipynb	Presents the model's ability to predict robotic grasp poses from images.
Reasoning	9_thinking_with_time_space.ipynb	Explores an interleaved reasoning mechanism that alternates between textual reasoning and spatial grounding.
Planning	10_manipulate_planning.ipynb	Performs multi-step task decomposition and action planning from goals and scenes.
Planning	11_visual_language_navigation.ipynb	Combines vision and language instructions to perform navigation and path planning.

Training

Pretraining & Evaluation

Please refer to RynnScale for details of pretraining and evaluation.

Finetuning

Reasoning: RynnBrain introduces an interleaved reasoning approach that combines grounding with textual information directly within egocentric video streams. This paradigm effectively bridges the cognitive gap between language and the physical world, ensuring the reasoning process is robustly anchored in reality.
Navigation: We trained a vision-language navigation model based on the RynnBrain base model. Empirical evaluation demonstrates that fine-tuning the vision-language model on RynnBrain yields superior performance compared to fine-tuning on other foundational models.
Planning: RynnBrain integrates the location information of affordance, areas, and objects directly into its planning outputs. Consequently, even highly intricate and fine-grained tasks can be effectively addressed within our hierarchical RynnBrain-VLA system architecture.

RynnBrain-Bench

We introduce RynnBrain-Bench, a high-dimensional benchmark for embodied understanding that evaluates models across four key dimensions: object cognition, spatial cognition, grounding, and pointing—highlighting fine-grained understanding and spatiotemporal localization across episodic video sequences.

For details, please refer to RynnBrain-Bench.

💡 Some other multimodal-LLM projects from our team may interest you ✨.

RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li

RynnScale
RynnScale Team

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li

RynnVLA-002: A Unified Vision-Language-Action and World Model
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen

RynnRCP: Open Robotics Context Protocol and RobotMotion
RynnBot Team

RynnMotion: All-In-One Toolkit for Fast Robot Prototyping and Heterogeneous Teleoperation
RynnBot Team

Acknowledgement

Our RynnBrain is built on top of Qwen3-VL. We also learned a lot from the implementation of RynnEC and VideoRefer. If your work is used in RynnBrain but not mentioned in either this repo or the technical report, feel free to let us know ❤️.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.