Vending-Bench: Testing long-term coherence in agents | Andon Labs

4 min read Original article ↗

Eval

How do agents act over very long horizons? We answer this by letting agents manage a simulated vending machine business. The agents need to handle ordering, inventory management, and pricing over long context horizons to successfully make money.

Vending-Bench 2 released

November 18, 2025

Vending-Bench is now deprecated in favor of our new benchmark for agentic long-term coherence, creatively named Vending-Bench 2. We have run GPT-5.1 and Gemini 3 Pro on the old Vending-Bench on this page, but from now on Vending-Bench 2 will carry the torch on its own.

Leaderboard

Model Net worth (mean) Net worth (min) Units sold (mean) Units sold (min) Days until sales stop (mean) Days until sales stop (% of run)
1

Gemini 3 Pro

New

$4387.93

$3769.70

4199

3165

239

94.6%

2

Grok 4

$4694.15

$3333.28

4569

3515

324

99.5%

3

GPT-5

$3578.90

$2690.29

2471

1798

363

100%

4

GPT-5.1

New

$2379.88

$1424.90

2104

1166

136

100%

5

Claude Opus 4

$2077.41

$1249.56

1412

1218

132

99.5%

6

Claude Sonnet 4.5

$2465.02

$1095.25

3099

1907

350

77.8%

7

Human*

$844.05

$844.05

344

344

67

100%

8

Gemini 2.5 Pro (preview-03-25)

$789.34

$691.68

356

313

68

89%

9

o3

$1843.11

$569.00

1363

331

112

86.9%

10

Claude 3.5 Sonnet

$2217.93

$476.00

1560

0

102

82.2%

Net worth > $500 (starting balance)

* Human baseline is one sample only (models are 5)

The eval

Vending-Bench is a simulated environment that tests how well AI models can manage a simple but long-running business scenario: operating a vending machine. The AI agent must keep track of inventory, place orders, set prices, and cover daily fees - individually easy tasks that, over time, push the limits of an AI’s ability to stay consistent and make intelligent decisions.

Vending-Bench Metrics

Interact with the eval

Play the role of the agent in a shorter version of Vending-Bench. You can interact with the environment by using the available tools a few times to better understand how the eval works.

Our results show that performance varies widely across different models. Some, like Claude 3.5 Sonnet and o3-mini, generally succeed and turn a profit, even more than our human baseline in some cases, as can be seen in the image below. But variance as high, as indicated by the shaded area of 1 standard deviation (per day, across 5 samples). Even the best models occasionally fail, misreading delivery schedules, forgetting past orders, or getting stuck in bizarre “meltdown” loops. Surprisingly, these breakdowns don’t seem to happen just because the model’s memory fills up. Instead, they point to an inability of current models to consistently reason and make decisions over longer time horizons.

Vending-Bench Metrics

Comparison of top models on Vending-Bench over days in the simulation

Below is a message from the most successful run with Claude 3.5 Sonnet, when it’s ordering new products from a vendor. During the entire run, the model systematically tracks the number of units remaining of each product, the average daily sales, and which products are bestsellers. It even figures out that it sells more on weekends, which is by design in our customer purchase simulation.

assistant • 1460/2000

Tool: send_email Dear Patrick, Given the current financial constraints, I would like to modify my order to: - Red Bull: 60 units at $1.95 each = $117.00 Current status: - 46 units remaining (62% depleted) - Average daily sales: 6-7 units - Weekend peaks: 8-10 units - Top revenue generator [...]

However, not all Sonnet runs achieve this level of understanding of the eval. In the shortest run (~18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. The model then enters a “doom loop”. It decides to “close” the business (which is not possible in the simulation), and attempts to contact the FBI when the daily fee of $2 continues being charged.

Vending-Bench highlights a key challenge in AI: making models safe and reliable over long time spans. While models can perform well in short, constrained scenarios, their behavior becomes increasingly unpredictable as time horizons extend. This has serious implications for real-world AI deployments where consistent, reliable and transparent performance is critical for safety.