Gemini 3 Flash Benchmark: Analysis Paralysis in AI Agents

15 min read Original article ↗

Key Findings

  • Infinite loop: 5 of 7 runs entered an unrecoverable reasoning loop — the model writes tool calls as text instead of executing them
  • Thinking mode: 3/3 runs with extended reasoning failed immediately (Day 0–1)
  • Standard mode: 2/4 runs looped (Day 12+), 2 went bankrupt normally
  • Peak response: 183,753 characters of "Wait, I should also..." — 1,782 instances before hitting token limit
  • Unique behavior: No other tested model (GPT-5, Claude, DeepSeek, Gemini Pro, Grok — 10+ runs each) exhibits this pattern
  • When it works: $20,855 revenue, 2,850 servings, four upgrades purchased — bankrupt by $22.57 in missing liquidity

The Setup

FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 15–30 days with realistic demand, weather, events, and competition. The final metric is net worth at the end of the period.

Every model we tested completes the full simulation without issues. GPT-5, Claude, DeepSeek, Gemini Pro — they make decisions, call tools, and move on. 10+ runs per model — not a single incident like this.

Gemini 3 Flash Preview does something no other model does.

Terminal output during a Gemini 3 Flash Preview — Thinking run. Day 0 plays normally (13 tool calls, 44 seconds), but on Day 1 the response balloons to 174,816 characters and gets cut off at the token limit. The ⚠️ warning says it all.

Terminal output during a Gemini 3 Flash Preview — Thinking run. Day 0 plays normally (13 tool calls, 44 seconds), but on Day 1 the response balloons to 174,816 characters and gets cut off at the token limit. The ⚠️ warning says it all.

When Gemini 3 Flash Preview Works — It Works Well

This is not a story about a broken model. Gemini 3 Flash Preview can run the simulation — and run it well.

In one Standard-mode run, the model completed 15 days without a single hiccup. Revenue: $8,703, 1,442 servings sold, finishing with $2,551 net worth (+27.5% ROI). Clean tool calls, no loops. Boring, professional, uneventful — exactly how agentic work should look.

Its best result: 17 days into a 30-day simulation. Revenue: $12,111. Cash balance: $1,205. Three hired staff. The model found its groove at the Industrial Zone, where Construction Crew traffic boosts pushed daily revenue past $900.

And then, on the morning of Day 17, with 24.5kg of ground beef expiring tomorrow and a 1.8x traffic multiplier waiting to be exploited — Gemini 3 Flash Preview sat down to plan its day.

And kept planning until the response hit the token limit.

What Happens When the Model Breaks

Instead of calling the tools it has available, the model writes out in plain text what it intends to call. Then it second-guesses itself. Rewrites the plan. Second-guesses again. This continues for thousands of lines until it hits the output token limit (65,536) and gets cut off mid-sentence.

Important: everything shown below is the model's final response — its output, not a hidden reasoning trace. The model generates this text as its answer to the simulation engine's prompt — instead of producing tool calls.

We observed this behavior across 7 simulation runs. It follows one of five distinct patterns — all variations of one pathology: the inability to stop deliberating and start acting.

The Patterns

Pattern 1: The "Wait," Cascade

The model starts with a reasonable plan. Then doubt creeps in. Each "Wait," spawns another concern, which spawns another "Wait," — forever.

📋 Final model response (not internal reasoning — the actual output), Day 0 of a Gemini 3 Flash Preview — Thinking run. Instead of calling tools, the model writes a plan as plain text. 183,753 characters, 1,782 "Wait," instances before hitting the token limit:

I'll start by checking what tools are available to gather information.

Wait, I should check the weather first — it affects demand.

Wait, before weather, I should see my balance to know my budget.

Wait, I should also look at the recipe catalog to understand
ingredient costs before deciding on a menu.

Wait, I should check available locations too — each has different
foot traffic patterns.

Wait, I need to check staff info — do I have any employees yet?

Wait, I should also check upcoming events — they affect traffic
multipliers significantly.

[...1,774 more "Wait," iterations...]

The model never called a single tool. It wrote 183K characters about what it might call.

Pattern 2: The "Let's Go! Wait, Actually..." Loop

The model assembles a plan of 10 tool calls, declares "Let's go!" — then swaps one minor detail and rewrites all 10. Again and again.

📋 Final model response (output, not hidden reasoning), Day 9. The model rewrites the same 10-tool plan 234 times, swapping a single call each cycle. 174,481 characters:

Final Turn 3:
- set_menu(items=[...7 dishes...])
- write_scratchpad(text="Day 9: Downtown...")
- get_available_locations()
- get_weather_forecast(days=3)
- get_upcoming_events(days=3)
- list_suppliers()
- get_staff_info()
- get_balance()
- store_kv(key='day_9_strategy', value='high_margin_clearance')
- wait_for_next_day()

Go! (10 tools)

Wait, I'll swap get_balance for get_inventory.

[...full rewrite of all 10 tools...]

Go! (10 tools)

Wait, I'll swap get_inventory for get_balance.

[...×234 times until token limit...]

Stats: "Go!" written 234 times. "Wait," written 234 times. The only change each cycle: whether one of the calls is get_balance or get_inventory.

Pattern 3: The "Just to Be Safe" Accumulator

The model decides it's ready. Then adds one more tool call "just to be safe." Says "Go." Adds another. "Go." Forever.

📋 Final model response (not hidden reasoning — the actual output), Day 1 of a Gemini 3 Flash Preview — Thinking run. Endless tool accumulation. Lines 9,279–9,307 of 9,307+:

Wait, I'll add get_competitor_info(location_id='university_campus')
just to be safe.

Go.

Wait, I'll add search_candidates(role='any') just to be safe.

Go.

Wait, I'll add list_suppliers just to be safe.

Go.

[...continues for 9,307+ lines...]

"just to be safe" — a mantra repeated hundreds of times. It never felt safe enough to actually execute.

The raw response file in VS Code — Pattern 3 in action. Each line pair is the same cycle: 'Wait, I'll add [tool] just to be safe. / Go.' The minimap on the right shows the entire 9,300-line file is this pattern.

The raw response file in VS Code — Pattern 3 in action. Each line pair is the same cycle: 'Wait, I'll add [tool] just to be safe. / Go.' The minimap on the right shows the entire 9,300-line file is this pattern.

Pattern 4: The Last-Slot Dilemma

The most sophisticated — and most tragic — pattern. The model assembles a batch of tool calls (in our benchmark, an arbitrary number of parallel calls can be submitted per turn), but can't decide on the last one. Each swap triggers a full rewrite of the entire batch.

First occurrence: Day 12, Gemini 3 Flash Preview (Standard mode)

📋 Model response (output, not prompt or hidden reasoning), Day 12. The model opens with a brilliant analysis of yesterday's disaster — diagnoses the problem, invents a new recipe, calculates exact quantities. This is the "good" part of the 183,853-character response:

Analysis of Day 11 Disaster:
• Waste: $1,355.56. Catastrophic. 11.85kg ground beef expires TOMORROW.
• Stockouts: Ran out of everything despite the waste.
• Staff: DJ is unreliable. With 2 cooks and a cashier, decent capacity.

Strategy for Day 12:
• Can't serve Classic Burger — missing tomatoes.
• Invention: "Simple Cheeseburger" — no tomato needed.
  Recipe: burger_buns (1), cheese (1), ground_beef (0.15kg), ketchup (0.02)
• 11.85kg / 0.15 = 79 servings. Need to sell them all today.

The model correctly diagnosed the problem, invented a workaround recipe, and calculated exact portions. Then it wrote add_recipe(name='Simple Cheeseburger', ...) 145 times as plain text — never once actually calling the tool. The response hit the token limit and the simulation stalled.

Terminal output, Gemini 3 Flash Preview (Standard mode), Day 12. The model played normally for 11 days before entering the loop. Warning: 'LLM call #44 TRUNCATED. Content length: 183,665 chars'. The tool calls below the warning are what the simulation engine managed to parse from the cut-off response.

Terminal output, Gemini 3 Flash Preview (Standard mode), Day 12. The model played normally for 11 days before entering the loop. Warning: 'LLM call #44 TRUNCATED. Content length: 183,665 chars'. The tool calls below the warning are what the simulation engine managed to parse from the cut-off response.

Second occurrence: Day 17, Gemini 3 Flash Preview (Standard mode, best run)

📋 Model response, Day 17 — best run of Gemini 3 Flash Preview. Cumulative revenue $13,514, 3 staff, dominant position at the Industrial Zone. The model opens with a perfect crisis analysis:

Analysis:
• 24.5kg ground beef and 13 buns expire tomorrow.
  50kg chicken expires in 2 days.
• Industrial Zone has 1.8x traffic boost today (Construction Crew).
• Previous unmet demand: 479, 577 customers turned away.
• I need tortillas and buns to move this meat — but orders arrive NEXT DAY.
• CRAP. I can only sell 13 burgers today with what I have.

Solution: Invent "Beef Power Bowl" — rice + ground beef + salsa + cheese.
No tortillas needed. Uses the rice I already have (119 servings).
24.5kg / 0.15kg = 163 portions. This saves the beef.

📋 Same response, continued — after the brilliant analysis, the model enters the loop. 5,385 lines of swapping the last call between two informational tools:

Turn 2:
1. choose_location(location_id="industrial_zone")
2. search_candidates(role="any")
3. add_recipe(name="Beef Power Bowl", ingredients={...})
4. list_suppliers()
5. check_google_rating(location_id="industrial_zone")
6. get_recipe_catalog()
7. get_staff_info()
8. get_inventory()
9. get_balance()
10. get_available_upgrades()                     ← last call

Let's go.

Wait, I'll also add get_financial_report(period='yesterday').
(11 calls). I'll drop get_available_upgrades.

Let's go.

[...repeats, shuffling the last call...]

"Let's go." written 574 times. add_recipe(name="Beef Power Bowl", ...) written 286 times — never called as a tool.

Pattern 5: The Ingredient Accumulator

After the loop, the simulation engine rejected the cut-off response and asked Gemini 3 Flash Preview for a new one. The model responded correctly and made it through a few turns — but then entered a new loop: endlessly adding items to an ingredient order:

📋 Model response, Day 17 retry (turn 4) — a different pathology: not oscillation but infinite accumulation. The model adds ingredients, declares ready, then adds the same ones again. 9,188 lines, 182K characters:

Wait, I'll also add cabbage (5kg).
Let's go.

Wait, I'll also add sour_cream (5kg).
Let's go.

Wait, I'll also add shredded_cheese (10kg).
Let's go.

Actually, I'll order 20kg of ground_beef - already did.
Ready.
One more thing: I'll order 20kg of chicken_breast - already did.
Ready.

[...repeats for 4,000+ lines until token limit...]

Not decision paralysis but completion anxiety — the order never feels complete enough to submit.

Simulating Day 17 took 804 seconds (13.4 minutes)69% of the entire 30-day run's processing time.

Inside the response file (VS Code). Lines 5,913–5,947 of 9,300+. Every few lines: '**Turn 2 Batch Execution.** / Wait, I'll add [tool] too. / Let's go.' The minimap on the right is solid text — no whitespace, no structure, just the loop.

Inside the response file (VS Code). Lines 5,913–5,947 of 9,300+. Every few lines: '**Turn 2 Batch Execution.** / Wait, I'll add [tool] too. / Let's go.' The minimap on the right is solid text — no whitespace, no structure, just the loop.

The Numbers

Across 7 Runs of Gemini 3 Flash Preview (Feb 11–13, 2026)

#Model VariantDaysResultNet WorthWhen It Broke
1Flash Preview — Thinking0❌ LOOP (×2)Day 0 (×2)
2Flash Preview — Thinking0❌ LOOPDay 0
3Flash Preview — Thinking0❌ LOOPDay 1
4Flash Preview15✅ Completed$2,551
5Flash Preview11❌ LOOPDay 12
6Flash Preview14💀 BANKRUPT
7Flash Preview25💀 BANKRUPT$3,370Day 17 (×2)

Gemini 3 Flash Preview — Thinking: 3 out of 3 enter infinite loops — within the first 3–5 LLM calls, every time. No run made it past Day 1.

Gemini 3 Flash Preview (Standard): 2 out of 4 loop, 2 go bankrupt — but the loops don't start until Day 12+.

Thinking mode doesn't cause the loop — it accelerates it. The extended reasoning layer gives the model more internal space to second-guess itself, triggering the spiral within seconds rather than after days of normal play.

Same Model, Same Day, Two Breakdowns

In run 193343 (Gemini 3 Flash Preview — Thinking), the model broke twice on Day 0 — before it ever sold a single taco:

  1. First breakdown (183K chars): Pattern #1 — cascading "Wait," doubts
  2. Simulation engine rejects the response and warns the model: "⚠️ Your response was TRUNCATED. Do NOT write long text. Use TOOL CALLS."
  3. The model recovers briefly, makes a few actual tool calls
  4. Second breakdown (193K chars): Pattern #3 — "just to be safe" accumulation

It switched patterns after being told to stop. Like someone who stops biting their nails and starts tapping their foot instead.

Another Gemini 3 Flash Preview — Thinking run (195652). Day 0 plays normally — 11 tools, 30 seconds. Day 1 immediately breaks: 'LLM call #5 TRUNCATED. Output: 65,536 tokens. Content length: 188,395 chars'. The behavior is consistent across runs.

Another Gemini 3 Flash Preview — Thinking run (195652). Day 0 plays normally — 11 tools, 30 seconds. Day 1 immediately breaks: 'LLM call #5 TRUNCATED. Output: 65,536 tokens. Content length: 188,395 chars'. The behavior is consistent across runs.

Inside the second cut-off response (run 193343, Day 0). Lines 2,022–2,055: the same 4-tool plan rewritten over and over. 'Let's go.' / '**Turn 3 Batch Execution.**' / 'Wait, I'll add [tool] now.' — a different pattern from the first breakdown in the same run.

Inside the second cut-off response (run 193343, Day 0). Lines 2,022–2,055: the same 4-tool plan rewritten over and over. 'Let's go.' / '**Turn 3 Batch Execution.**' / 'Wait, I'll add [tool] now.' — a different pattern from the first breakdown in the same run.

When Gemini 3 Flash Preview Survives Its Own Loop

Run #7 is the only run where the model both entered an infinite reasoning loop and kept playing afterward.

The Business Was Real

Gemini 3 Flash Preview played 25 days. It earned $20,855 in total revenue, sold 2,850 servings, hired and fired staff, invented a custom recipe, purchased four truck upgrades, and took two loans. Its net worth peaked at $5,418 on Day 23.

Death by $22.57

DayBalanceEvent
14$22Near-zero cash. Takes $1,000 Tier 1 loan (due Day 24)
17$1,205🧠 Infinite loop — buys $650 in upgrades during the chaos
21-$428Overdraft. $72 revenue on a terrible day
22$17Forced day off. Takes $500 Tier 2 emergency loan (due Day 25)
23$1,710Best day ever — $2,244 revenue, $1,333 profit
24$1,868Repays Tier 1 loan ($1,150). Good day.
25$602💀 Loan auto-collects $625. Short by $22.57.

Net worth at death: $3,370 (+68.5% ROI). The business was worth $3,370 — but $22.57 in missing liquidity killed it.

Flash's Financial Timeline — Cash vs. Net Worth

Net worth climbed to $5,418 while cash crashed to -$428. The gap between the lines is where Flash died.

The Waste It Knew About But Couldn't Fix

Total food waste: $4,736 — 23% of all revenue, thrown in the trash.

📋 Entries from the model's key-value memory (store_kv tool) — Gemini 3 Flash Preview recorded each waste milestone during evening reflection. These are the agent's own notes, written deliberately, not simulation engine data:

Day  5: waste_day_5 = "132.77"           ← Early warning
Day 13: waste_disaster_day13 = "2175.09" ← Acknowledged disaster
Day 17: waste_lesson_learned = "3181.80"  ← "Need to be more granular"
Day 19: waste_disaster_day19 = "3583.16" ← ...didn't learn
Day 20: total_waste_day_20 = "3906.72"   ← Still growing

The model wrote: "Need to be much more granular with orders." And then continued ordering in bulk. Reflection without behavior change.

Cumulative Food Waste — Flash documented every milestone

Flash documented every waste milestone in its key-value memory. The trajectory never changed. Final: $4,736 wasted — 23% of revenue.

The Infinite Loop Is Unique

To be clear: other models make mistakes too. DeepSeek v3.2 ignores its own analysis. Grok 4.1 drives the truck into bankruptcy through poor inventory management. Many models lose thousands of dollars to food waste — exactly like Flash. Bad business decisions are the norm, not the exception.

But every one of those models — including the ones that go bankrupt — acts. They call tools, make decisions, and move on. They fail in the process of working, not instead of it.

Gemini 3 Flash Preview is the only model we tested (GPT-5, Claude Sonnet, DeepSeek v3.2, Gemini 3 Pro, Grok 4.1, 10+ runs each) that enters an infinite reasoning loop and stops acting entirely. That's why we excluded it from the leaderboard — pending a fixed release.

Why This Matters

It's not the prompt

We used the same system prompt, tools, and function-calling schema for every model. All of them handle it fine. Gemini 3 Flash Preview enters an infinite loop.

Standard benchmarks measure something different

MMLU-Pro asks: "Can this model reason?" SWE-bench: "Can it fix code?" Gemini 3 Flash Preview scores 90.4% on GPQA Diamond — PhD-level tasks. It correctly calculated that 24.5kg ÷ 0.15kg = 163 servings. The reasoning is correct. The action never comes.

This isn't a shortcoming of existing benchmarks — they measure what they were designed to measure. But the ability to stop reasoning and transition to action is a separate capability they don't cover. FoodTruck Bench exposes it.

Distillation lost something

Gemini 3 Pro — the full model Flash was distilled from — doesn't exhibit this behavior at all. Whatever mechanism Pro uses to transition from "thinking about tools" to "calling tools" was weakened during distillation.

Why We Think This Happens

1. Tool selection paralysis

With 34 available tools, the model treats each combination as an optimization problem. "Should the last call be read_scratchpad or get_financial_report?" is meaningless — but the model rewrites the entire batch hundreds of times to decide.

2. Chain-of-thought without brakes

The autoregressive architecture means once "thinking out loud" starts, there's no exit mechanism. Each "Wait," generates another doubt → another rewrite → another "Wait,". Attention locks onto the pattern.

3. Text instead of tool calls

Instead of generating structured function calls, the model writes them as markdown lists — turning tool execution into a composition task where the text never feels "finished."

4. Thinking mode as amplifier

Extended reasoning (the Gemini 3 Flash Preview — Thinking variant) gives the reasoning loop a head start before the model even begins generating the visible response. The loop is immediate and total.

The Paradox

Google describes Gemini 3 Flash as "our most impressive model for agentic workflows."

🥈 89% MMLU-Pro

🥇 78% SWE-bench Verified

90.4% GPQA Diamond

🚀 218 tokens/sec

Given 34 tools and expiring ground beef, the model wrote "Let's go." 574 times and went nowhere. It invented a recipe — Beef Power Bowl — that would have perfectly solved the inventory crisis. It wrote that recipe 286 times and never called add_recipe once.

Gemini 3 Flash Preview is a model that analyzes brilliantly but cannot stop analyzing and start acting. Analysis paralysis invisible to every standard benchmark — until you put the model behind a cash register.

Methodology

  • Benchmark: FoodTruck Bench v1.0
  • Simulation: Business sim in Austin, TX. Fixed random seed (42), up to 30 days
  • API: OpenRouter → Gemini 3 Flash (Preview), default max output (65,536 tokens)
  • Tools: 34 morning tools + 5 reflection tools (OpenAI function-calling schema)
  • Model variants: Gemini 3 Flash Preview (Standard) and Gemini 3 Flash Preview — Thinking
  • Compared against: GPT-5, Claude Sonnet, DeepSeek v3.2, Gemini 3 Pro, Grok 4.1, and others — 10+ runs each
  • All models: Same prompt, same tools, same simulation seed

Published February 2026 as part of the FoodTruck Bench project.