GitHub - alibaba-damo-academy/RynnBrain: RynnBrain: Open Embodied Foundation Models

5 min read Original article ↗

RynnBrain

💫 Project Page   |    Models & Bench 🤗 🤖  |  🚀 Demo    |   📚 Cookbooks  

📰 News

  • [2026.02.15] 🔥🔥 Release our Technical Report !!
  • [2026.02.09] 🔥🔥 Release our code and model checkpoints!!

Introduction

We present RynnBrain, an embodied foundation model grounded in physical reality. RynnBrain is available in two dense variants (2B and 8B) and one mixture-of-experts (MoE) model (30B-A3B). In addition, we release three post‑trained models: RynnBrain‑Plan (robot task planning), RynnBrain‑Nav (vision-language navigation), and RynnBrain‑CoP (chain-of-point reasoning).

🌟 Key Highlights

  • Comprehensive egocentric understanding: Excels in fine-grained video understanding and egocentric cognition, covering tasks such as embodied QA, counting, and OCR.
  • Diverse spatio-temporal localization: Possesses powerful localization capabilities across episodic memory, enabling precise identification of objects, target areas, and motion trajectories.
  • Physical-space reasoning: Employs an interleaved reasoning strategy that alternates between textual and spatial grounding, ensuring that its reasoning processes are firmly rooted in the physical environment.
  • Physics-aware precise planning: Integrates located affordances and object information into planning, enabling downstream VLA models to execute intricate tasks with fine-grained instructions.

Model Architecture

RynnBrain employs a unified encoder-decoder architecture (supporting both Dense and MoE variants) to transform omni-vision inputs and textual instructions into multi-modal outputs, including spatial trajectories, physical pointing, and action planning.  Through massive training on rich spatio-temporal, physical-space, and general knowledge data, RynnBrain maintains robust general-purpose capabilities while specializing in diverse, fine-grained embodied reasoning and complex planning tasks.

Performance

  • General Embodied Understanding

  • Robot Task Planning

  • Vision-Language Navigation

Model Zoo

Model Base Model HuggingFace ModelScope
RynnBrain-2B Qwen3-VL-2B-Instruct Link Link
RynnBrain-8B Qwen3-VL-8B-Instruct Link Link
RynnBrain-30B-A3B Qwen3-VL-30B-A3B-Instruct Link Link
RynnBrain‑CoP-8B RynnBrain-8B Link Link
RynnBrain‑Plan-8B RynnBrain-8B Link Link
RynnBrain‑Plan-30B-A3B RynnBrain-30B-A3B Link Link
RynnBrain‑Nav-8B RynnBrain-8B Link Link

Quick Start

Minimal dependencies:

pip install transformers==4.57.1

Run text generation:

from transformers import AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained("")
...

Cookbooks

Checkout the cookbooks that showcase RynnBrain's capabilities in cognition, localization, reasoning, and planning.

Category Cookbook name Description
Cognition 1_spatial_understanding.ipynb Shows the model's ability for spatial understanding in the video scene.
Cognition 2_object_understanding.ipynb Shows how the model understands object categories, attributes, and relations and counting ability.
Cognition 3_ocr.ipynb Examples of optical character recognition and text understanding in videos.
Location 4_object_location.ipynb Locates specific objects with bounding boxes in an image or video based on instructions.
Location 5_area_location.ipynb Identifies and marks specified regions by points in an image or video.
Location 6_affordance_location.ipynb Finds areas or objects with specific affordances in an image or video.
Location 7_trajectory_location.ipynb Infers and annotates trajectories or motion paths in an image or video.
Location 8_grasp_pose.ipynb Presents the model's ability to predict robotic grasp poses from images.
Reasoning 9_thinking_with_time_space.ipynb Explores an interleaved reasoning mechanism that alternates between textual reasoning and spatial grounding.
Planning 10_manipulate_planning.ipynb Performs multi-step task decomposition and action planning from goals and scenes.
Planning 11_visual_language_navigation.ipynb Combines vision and language instructions to perform navigation and path planning.

Training

Pretraining & Evaluation

Please refer to RynnScale for details of pretraining and evaluation.

Finetuning

  • Reasoning: RynnBrain introduces an interleaved reasoning approach that combines grounding with textual information directly within egocentric video streams. This paradigm effectively bridges the cognitive gap between language and the physical world, ensuring the reasoning process is robustly anchored in reality.

  • Navigation: We trained a vision-language navigation model based on the RynnBrain base model. Empirical evaluation demonstrates that fine-tuning the vision-language model on RynnBrain yields superior performance compared to fine-tuning on other foundational models.

  • Planning: RynnBrain integrates the location information of affordance, areas, and objects directly into its planning outputs. Consequently, even highly intricate and fine-grained tasks can be effectively addressed within our hierarchical RynnBrain-VLA system architecture.

RynnBrain-Bench

We introduce RynnBrain-Bench, a high-dimensional benchmark for embodied understanding that evaluates models across four key dimensions: object cognition, spatial cognition, grounding, and pointing—highlighting fine-grained understanding and spatiotemporal localization across episodic video sequences.

For details, please refer to RynnBrain-Bench.

💡 Some other multimodal-LLM projects from our team may interest you ✨.

RynnEC: Bringing MLLMs into Embodied World
Ronghao Dang*, Yuqian Yuan*, Yunxuan Mao*, Kehan Li*, Jiangpin Liu, Zhikai Wang, Fan Wang, Deli Zhao, Xin Li
github github arXiv

RynnScale
RynnScale Team
github github

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
github github arXiv

RynnVLA-002: A Unified Vision-Language-Action and World Model
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, Hao Chen
github github arXiv

RynnRCP: Open Robotics Context Protocol and RobotMotion
RynnBot Team
github github

RynnMotion: All-In-One Toolkit for Fast Robot Prototyping and Heterogeneous Teleoperation
RynnBot Team
github github

Acknowledgement

Our RynnBrain is built on top of Qwen3-VL. We also learned a lot from the implementation of RynnEC and VideoRefer. If your work is used in RynnBrain but not mentioned in either this repo or the technical report, feel free to let us know ❤️.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.