Introducing Odyssey-2 Max: Scaled World Simulation

2 min read Original article ↗

World models are a new form of multimodal intelligence, distinct from language, image, and video models. They learn to reason about how the world evolves by training directly on visual observations of real-world action, rather than on its compressed reflection in text. We believe this is true multimodal intelligence.

The defining capability of world models is their ability to simulate open-ended futures via continuous, interactive rollouts that evolve with actions in real time. Bidirectional video models like Sora, Veo, Kling, and Runway cannot do this. They generate past, present, and future jointly from a prompt fixed in advance—a structure that rules out real-time interaction, since future frames would have to condition on actions the user has not yet taken. A world model must instead be causal, predicting each state from prior states and actions. This autoregressive formulation is the foundation of the Odyssey-2 series.

Rollout coherence requires world models to learn physics as the model must remain stable as it rolls forward step by step to avoid drift or collapse. This pressure forces the model to internalize how objects move, interact, and change—yielding an implicit simulation of physical processes as a consequence of next-state prediction. As these models scale, the quality of this simulation increases, enabling applications in science, robotics, gaming, defense, and healthcare.

Odyssey-2 Max achieves the highest physics score among evaluated world models—all while running in real time. To evaluate physical accuracy of world models we follow common practice and evaluate on VBench 2—a benchmark designed to assess the faithfulness of generated video. Specifically its physics sub-score assesses the accurate modelling of mechanics, thermotics, materials, and multi-view consistency. Additionally, we evaluate on the physics modelling subset of the commonly used Physical AI benchmark.