VideoGameBench

2 min read Original article ↗

Here are some examples of VLM agents playing games on VideoGameBench. These clips demonstrate both the capabilities and current limitations of state-of-the-art vision-language models when faced with real-time gaming challenges.

Example 1. Gemini 2.5 Pro plays Kirby's Dream Land in real-time, successfully navigating through the initial level and reaching the first mini-boss encounter. The agent demonstrates basic platforming abilities and enemy interaction.

Example 2. Gemini 2.5 Pro plays Civilization I in real-time, demonstrating poor strategic planning and resource management. The agent makes suboptimal military decisions, resulting in rapid defeat against Napoleon's forces.

Example 3. Gemini 2.5 Pro explores The Legend of Zelda: Link's Awakening, wandering aimlessly while searching for Link's sword. This demonstrates challenges in objective-oriented navigation and game state understanding.

Example 4. Claude Sonnet 3.7 attempts The Incredible Machine, a puzzle game requiring precise object placement and physics understanding. The agent exhibits difficulties with accurate cursor control and spatial reasoning.

Example 5. GPT-4o plays Pokémon Crystal, successfully selecting Cyndaquil as its starter Pokémon but subsequently losing track of the primary objective. This highlights issues with long-term memory and goal persistence in complex RPG environments.

Example 6. Our VG-Agent (using GPT-4o) plays Doom II (easiest difficulty) on VideoGameBench Lite, where the environment pauses while the agent thinks. The agent demonstrates basic combat abilities and navigation but struggles with complex strategic decisions.

As demonstrated in these examples, current state-of-the-art VLMs exhibit varying degrees of competency across different game genres. While they can perform basic interactions such as movement, menu navigation, and simple combat, they consistently struggle with higher-order cognitive tasks including strategic planning, spatial reasoning, objective maintenance, and adaptive problem-solving. These limitations underscore the substantial gap between current AI capabilities and human-level performance in complex interactive environments.