Cosmos 3

7 min read Original article ↗

Omnimodal World Models for Physical AI

Multiple modalities, many applications.many applications.
One single model.

Cosmos 3 connects understanding, generation, simulation, and action through a shared omnimodal world model that moves fluidly across text, images, video, audio, and actions.

Language

Image

Video

Audio

Action

Language

Image

Video

Audio

Action

Explore how Cosmos 3 couples different modalities with each capability.
Check out the results by selecting the task.

Vision-Language Reasoning

Reason through the physical world.

Cosmos 3 grounds language in images and video, reading spatial relationships, temporal cues, object states, and actions as shared context for deeper physical reasoning.

I am decelerating and keeping my lane as I approach an intersection with traffic lights and other vehicles. The presence of traffic lights and vehicles ahead necessitates that I slow down to ensure safety and compliance with traffic rules. The lane markings indicate a straight path, and I am maintaining my lane position.

<think>

I will move my gripper from its current position at [490, 419] to the red flower at [390, 700] to grasp it. After securely picking up the flower, I will lift it and move it to the red bottle at [710, 605], positioning the gripper above the bottle’s opening at [710, 500] so I can place the flower inside. This trajectory gives me a direct and efficient path from the flower to the target container while avoiding obstacles on the wooden table.

The trajectory is: [490, 419], [388, 672], [411, 411], [690, 364], [690, 364]

</think>

  1. (490, 419) Start trajectory
  2. (388, 672) Move to flower
  3. (411, 411) Lift flower
  4. (690, 364) Move above bottle
  5. (690, 364) Place flower
  • (0.3, 3.4): "A humanoid robot with a sleek white and black design stands beside a red popcorn dispenser filled with golden popcorn. The robot uses its right arm to pick up a green paper cup from the table in front of it, preparing to fill it."
  • (3.4, 14.8): "The robot holds the green cup steady with its left arm while using its right arm to maneuver a metal scoop into the popcorn dispenser. It scoops popcorn twice, carefully transferring each portion into the cup and ensuring the cup is adequately filled."
  • (14.8, 18.7): "After filling the cup, the robot places it back on the table and returns the scoop to its original position inside the dispenser. It then retracts both arms slightly, completing the task with precision and efficiency."
  1. IMPOSSIBLE: (495, 92, 708, 272)
  2. IMPOSSIBLE: (497, 267, 712, 462)
  3. IMPOSSIBLE: (499, 462, 727, 662)

Image Generation

Create scenes with physical detail.

Cosmos 3 turns language descriptions into realistic images, preserving object detail, spatial layout, and physical cues for downstream generation.

Audio-Visual Generation

Generate worlds with sight and sound.

Cosmos 3 creates physically aware video from text, images, or clips, and can pair motion with audio that follows visible events, source movement, and scene context.

Robot Policy

Turn perception into action.

Cosmos 3 adapts the same omnimodal backbone to follow instructions and translate visual context into purposeful planning and manipulation.

Forward Dynamics

Simulate the future world.

Cosmos 3 conditions on observations and controls to roll out future videos, helping agents preview outcomes for planning, evaluation, and data generation.

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input hand pose

Input camera

Input camera

Input camera

Input camera

Input camera

Input camera

Inverse Dynamics

Infer the action behind change.

Cosmos 3 reads observed states and recovers the trajectory or control that explains how the scene changed, turning video evidence into actions.

Input video

Input video

Input video

Input video

Input video

Reasoning + Generation

Plan first, then bring it to life.

Cosmos 3 can reason over an input scene, produce an actionable trajectory, and use that structured intent to generate the resulting physical interaction.

You are given the task "pick the Cosmos 3 Omni block from the bottom drawer and place it on the counter". First, identify all relevant objects and free spaces in the scene, and then specify the 2D trajectory your end-effector should follow in pixel space. Return the trajectory as (x, y) coordinates. Answer the question using the following format: <think> Your reasoning. </think> Write the final 2D end-effector trajectory immediately after the </think> tag.

<think>
I can see the Cosmos 3 Omni block inside the bottom drawer, which is the object I need to pick up. My gripper is positioned to reach into the drawer and grasp the block. The bottom drawer provides the storage space where the block is located, and I need to access it to complete the task. Once I retrieve the block, I will place it on the counter surface, which is the designated target area. The open space inside the bottom drawer allows me to reach and grasp the block, while the clear surface area on the counter near the drawer opening provides the necessary space to place the block after retrieval. The other objects on the counter are not directly involved in the task but help define the workspace layout.

Based on this analysis, I need to locate the robot gripper, the Cosmos 3 Omni block, the bottom drawer, the counter surface, the yellow-and-black forklift, the white SUV, the white truck, and the robot with the "Physical AI Builder" sign. Their coordinates are as follows: robot gripper (514, 769), Cosmos 3 Omni block (471, 780), bottom drawer (400, 760), counter surface (460, 310), yellow-and-black forklift (235, 304), white SUV (605, 105), white truck (720, 163), robot with "Physical AI Builder" sign (790, 254). The open space inside the bottom drawer is at (380, 760), and the clear surface area on the counter near the drawer opening is at (400, 760). These free-space locations are the relevant points for reaching and placing the block.

I will move my gripper from its current position at (514, 769) to the Cosmos 3 Omni block at (471, 780) to grasp it. After securing the block, I will lift it and move it upward and slightly forward to clear the bottom drawer, which is located at (400, 760). Then I will carry the block to the counter surface at (460, 310), making sure to place it on the clear area near the drawer opening at (400, 760) to complete the task.

The planned trajectory is as follows: [514, 769], [507, 783], [500, 471], [464, 278], [460, 275]</think>

  1. (514, 769) Move the gripper to the Cosmos 3 Omni block
  2. (507, 783) Grasp the Cosmos 3 Omni block
  3. (500, 471) Lift the block out of the bottom drawer
  4. (464, 278) Move the block toward the counter surface
  5. (460, 275) Place the block on the counter surface

Top open foundation for Physical AI.

Cosmos 3 brings leading reasoning, generation, and action performance into open models researchers and builders can inspect, adapt, and deploy.

Reasoning

Leading Open Reasoner for Physical AI

Cosmos 3 ranks #1 among open models on Robotics, Smart Space, and Driving benchmark averages, showing strong physical-world understanding.

View Benchmark Results
Generation

Leading Open Generator for Physical AI

Cosmos 3 ranks #1 among open models for text-to-image, image-to-video, and robot policy across R-Bench, Artificial Analysis, RoboLab, and RoboArena benchmarks.

View Benchmark Results

Learn more and get started with Cosmos 3.