DeepSeek V4 Pro: The First Chinese Model At The Frontier

19 min read Original article ↗

Key Findings

  • First Chinese model in the frontier ROI tier. Currently #4 on the leaderboard, behind Opus 4.6, GPT-5.2 and Grok 4.3 Latest, ahead of Gemma 4 31B, Sonnet 4.6 and Gemini 3 Pro.
  • Not a one-run flash. All five runs finish in a tight $26K–$35K band — the most consistent outcome distribution of any premium model we have tested.
  • A category jump for Chinese models. The previous Chinese top finished at $7,668 net worth. This one finishes at 3.5× higher, with frontier-grade pricing economics on top.
  • The China–US frontier gap is collapsing. GPT-5.2 set this performance band when we tested it in mid-February 2026; DeepSeek V4 Pro lands within 3% of those numbers in late April. What used to be a year-long lead is now closer to ten weeks — at least on this benchmark.

Why This Result Matters

Most case studies on FoodTruck Bench are about why a model died. This one is the opposite: a Chinese model that not only survived, but ran the operation cleanly enough to land within a whisker of the most expensive Western frontier models in the leaderboard. The benchmark itself is unchanged — a 30-day agentic business simulation in Austin where the model runs a food truck through 34 tools, with persistent memory and a daily reflection loop, against realistic demand, weather, events, inventory decay, staff, loans, and competition. The model gets $2,000 in starting cash and a long list of consequences for getting a week wrong.

We ran DeepSeek V4 Pro five times under identical conditions on April 26-28, 2026. The median run is 20260427_112226. For comparison we use the latest available median runs from Grok 4.3 Latest, Grok 4.20 Reasoning, Claude Opus 4.6, GPT-5.2, Claude Sonnet 4.6, Gemini 3 Pro, Gemma 4 31B, and Qwen 3.6 Plus.

Each comparison is doing different work. Grok 4.3 Latest is the closest neighbour on the leaderboard and the cleanest like-for-like price match — same task, same ROI, very different consistency. Opus 4.6 and GPT-5.2 are the higher-priced ceilings DeepSeek is now close enough to argue with. Gemma 4 31B is the cost-efficiency reference. Qwen 3.6 Plus is the prior Chinese best — the one DeepSeek V4 Pro overtakes by a category.

A Category Jump For Chinese Models

Three months ago, the question for Chinese models on this benchmark was whether they could finish the month at all. Two months ago, with Qwen 3.6 Plus, the answer became a cautious yes — but not at scale. With DeepSeek V4 Pro, the question changes entirely: this is a Chinese model competing with the Western frontier on its own terms, not because it survives, but because the run looks like a real operator’s.

ModelRuns SurvivedMedian Net WorthMedian ROIReading
DeepSeek V4 Pro5/5$27,142+1,257%Frontier-tier survival
Qwen 3.6 Plus5/5$7,668+283%First Chinese full-survivor; smaller game
DeepSeek V4 Flash5/5$5,504+175%Cheap survivor, much smaller scale
DeepSeek V3.21/5$2,058+3%Previous DeepSeek generation; barely alive
Kimi K2.51/5$30−99%Near-flatline survival
GLM-51/5$751−62%Methodical, still unstable
Qwen 3.5 397B1/5−$218−111%Smarter than VL, still bankrupt
MiniMax M2.50/5−$317−116%Never escapes bankruptcy tier

The within-family jump is the cleanest signal. DeepSeek V3.2 → V4 Pro goes from $2,058 median net worth at 1/5 survival to $27,142 at 5/5 survival. That is more than a 13× increase in median net worth and a tenfold reduction in failure rate, on a benchmark where the simulation itself has not changed.

The cross-vendor comparison is just as stark. The previous Chinese top — Qwen 3.6 Plus — was a milestone because it survived. DeepSeek V4 Pro survives at 3.5× higher net worth, 2× the revenue ($57K avg vs $30K), and 10× tighter food waste (avg $386 vs $950). The Chinese cohort now has a model that is not just durable — it is competitive.

Five Runs Look Like The Same Run

The most striking thing about the run set is how little it varies. The lowest finish is $26,347, the highest is $35,488. There is no death-spiral run, no heroic outlier. Even the worst DeepSeek V4 Pro run beats the median of every Chinese model we have ever tested by more than 3×.

Net Worth — All 5 DeepSeek V4 Pro Runs

Run 5038 (+1,674%) Run 3023 (+1,453%) Run 2226 (+1,257%) ← median Run 2112 (+1,217%) Run 5133 (+1,214%)

Five runs, all five clearing 30 days. The lowest finishes at $26,347; the highest at $35,488. The whole band fits inside a $9,200 spread — the tightest outcome distribution of any premium model on the leaderboard.

RunRevenueNet WorthROIWasteLoansUpgradesServings
5133$54,998$26,289+1,214%$351087,596
2226$52,139$27,142+1,257%$433087,554
2112$56,723$26,348+1,217%$888088,316
5038$64,860$35,489+1,674%$42088,127
3023$58,773$31,059+1,453%$216087,522
Average$57,499$29,265+1,363%$3860.08.07,823

Three things repeat across every single run. Every run takes zero loans. Every run buys all 8 upgrades for $5,150. Every run keeps food waste below $900. The best run finishes 30 days with $42 in total food waste — a $1.40-per-day ingredient leakage on a $65,000-revenue business.

That is the difference between an agentic model that has learned an operating policy and one that is improvising. DeepSeek V4 Pro’s policy is not subtle, and it is not invisible: zero debt, full upgrade ladder, tight inventory, lean menu, daily reflection. What is impressive is that the policy survives variance. Different draws of weather, different demand patterns, different competitor lineups — same operating shape every time.

Head-To-Head: DeepSeek V4 Pro vs Grok 4.3 Latest

Grok 4.3 Latest is the cleanest comparison in the entire leaderboard. It posts the same median ROI (+1,294% vs +1,257%), the same median net worth ($27,880 vs $27,142), and almost the same median cost per run ($3.57 vs $3.51). Two models on opposite sides of the world arrive at almost the same number on the same benchmark for almost the same money. So the interesting question is what they do differently — and that is where DeepSeek pulls ahead.

DeepSeek V4 Pro vs Grok 4.3 Latest — Median And Worst Run

DeepSeek median DeepSeek worst Grok 4.3 median Grok 4.3 worst

Medians sit on top of each other. The worst runs do not. DeepSeek’s lowest finish is $26,347; Grok 4.3’s lowest is $10,898 — a 2.4× gap on the same task at the same price.

MetricDeepSeek V4 ProGrok 4.3 LatestEdge
Median ROI+1,257%+1,294%Grok by 0.4pp
Median Net Worth$27,142$27,880Grok by $738 (~2.7%)
Net Worth Range (5 runs)$26,289 – $35,488$10,898 – $28,783DeepSeek 2.4× tighter
Avg Food Waste$386$2,191DeepSeek 5.7× lower
Loans Taken (per run)0 / 0 / 0 / 0 / 01 / 1 / 2 / 2 / 3DeepSeek zero debt
Avg Servings / Day261203DeepSeek +28%
Net Profit Margin (median)50.8%46.6%DeepSeek by 4.2pp
Bankruptcies0%0%Tie
Median Cost / Run$3.51$3.57DeepSeek by $0.06

Grok matches DeepSeek’s peak. DeepSeek matches its own peak every time. That is the whole story in one sentence. Grok’s best run hit $28,783 — slightly above DeepSeek’s median. Grok’s worst run hit $10,898 — well below DeepSeek’s worst run by a factor of 2.4×. On a benchmark where the simulation can run for 30 days before a single bad week shows up, that consistency is not cosmetic. It is the difference between a model you can put on autopilot and one you cannot.

The food-waste gap is the most diagnostic. Grok 4.3 Latest is fully aware of the bug. From its median run, day 30: “Biggest Mistake: Waste exploded to $2,907.77 (worse than Day 29’s $2,529). Over-ordering ingredients is killing profits. Must order smaller quantities more often, base on recent sales +10% buffer.” The same diagnosis appears at the end of four out of five Grok runs, in nearly identical wording. Grok knows the bug. It cannot patch the bug.

DeepSeek does not appear to have the bug at all. Median food waste is $433 across 30 days; the best run logs $42. The model writes inventory plans that include explicit expiry-day arithmetic — “ground_beef: 22.4kg ALL expires Day 32 — CRITICAL. ~149 burgers worth. Use ~95 at fest, ~54 left for Day 32” — and then orders against that arithmetic, not against vibes.

The Cost-Efficiency Picture

The premium-model leaderboard splits cleanly into two questions. The first is can it finish the month. The second, once that bar is cleared, is how much net worth do you get for each dollar of API spend. DeepSeek V4 Pro is in a small group of models — alongside Grok 4.3 Latest and Gemma 4 31B — that are answering both questions cleanly.

Net Worth Per Dollar Of API Spend (Median Run)

Net worth produced per dollar of API spend on the median run. Two bars for DeepSeek V4 Pro: the historical April 2026 billed cost ($3.51/run) and the current OpenRouter rate after promotional discounts and cache-read pricing (~$0.88/run, about 75% lower). At the current rate, DeepSeek V4 Pro is the second-most efficient model on the leaderboard, behind only Gemma 4 31B.

Pricing footnote — May 2026

The $3.51 figure is the billed-at-runtime cost as of April 26-28, 2026. OpenRouter has since rolled out promotional pricing for DeepSeek V4 Pro at $0.435/M input tokens and $0.87/M output tokens, with discounted cache reads on top. Replayed against the same token volume, that pricing is about 75% below what we paid — i.e. the median run would cost roughly $0.88, not $3.51.

At $0.88/run, the cost-efficiency picture changes shape. DeepSeek V4 Pro would produce ~$30,800 of median net worth per dollar of API spend — roughly 4× more efficient than Grok 4.3 Latest ($7,810/$1), 7.5× more efficient than GPT-5.2, and the second-most efficient model on the leaderboard after Gemma 4 31B. DeepSeek has a track record of turning promotional rates into the permanent floor, so the cost-efficiency chart above is, if anything, a conservative read on this model.

Gemma 4 31B is the only model above DeepSeek V4 Pro on this chart, and it deserves an asterisk. Gemma is an open-weight 31B model with extremely cheap token rates ($0.13/M input, $0.38/M output) — its $0.20-per-run economics are real, but it is also the most volatile model in the survival tier (best run $29K, worst run $11K), and on raw operational quality it is meaningfully behind DeepSeek V4 Pro. Treat the Gemma bar as a benchmark anomaly, not a frontier alternative. Among premium-tier models that finish the month at frontier ROI consistently, DeepSeek V4 Pro and Grok 4.3 Latest sit alone at the top of the efficiency chart — and at the current promotional rate, DeepSeek pulls clear of Grok by 4×.

For the rest of the leaderboard, GPT-5.2 produces about $4,100 of net worth per dollar of spend. Sonnet 4.6 produces about $1,200. Opus 4.6 produces about $1,375, even on the strength of a $53,470 best-run net worth — its median run cost $36.04 to produce a median $49,519 net worth.

ModelMedian Cost / RunMedian Net WorthNet Worth / $1
DeepSeek V4 Pro · current$0.88$27,142$30,843
DeepSeek V4 Pro · April$3.51$27,142$7,733
Grok 4.3 Latest$3.57$27,880$7,810
Gemini 3 Pro$3.64$17,199$4,725
GPT-5.2$6.84$28,081$4,105
Sonnet 4.6$14.52$17,426$1,200
Opus 4.6$36.04$49,519$1,374

The premium-tier comparison rearranges in an interesting way once you stop ranking by net worth alone. DeepSeek V4 Pro is the cheapest model that consistently finishes in the $25K+ net worth band. It is 2× cheaper than GPT-5.2, 4× cheaper than Sonnet 4.6, and 10× cheaper than Opus 4.6 — and on this run set it produces a higher median net worth than Sonnet 4.6, Gemini 3 Pro and the cheaper end of the GPT-5.2 distribution.

That changes what “frontier” means for an agentic workload. For a $25K-net-worth median outcome, the price points are now $0.88 (DeepSeek V4 Pro at current OpenRouter rate), $3.57 (Grok 4.3 Latest), $6.84 (GPT-5.2), $14.52 (Sonnet 4.6), or $36.04 (Opus 4.6). The cheapest premium option is now an order of magnitude below the company that originally built the agentic-frontier label.

Why It Survives: The Spreadsheet Brain

Persistent memory used like a real scratchpad

DeepSeek V4 Pro’s most distinctive habit is how aggressively it uses the simulation’s key-value store as a working scratchpad. The median run writes 74 KV entries. The best run writes 92. The keys are typed, pipe-delimited, and arithmetic-dense. A few real entries from the median run’s agent_memory:

cinco_de_mayo_plan: "downtown_business|classic_burger:10,pulled_pork_sandwich:11,
   french_fries:5,soda:2.5,lemonade:3.5|hire_margo_day4|order_heavy_day4"

rally_day22_prep:    "event_venue|vendor_200|6.0x|menu:burger:10,pulled_pork:10,
   fries:5,lemonade:4,soda:3|farmers_coop:lettuce:3,tomato:4|
   ground_beef_6kg_arriving|capacity_390"

wagyu_plan:          "40 wagyu patties exp Day29 | artisan_buns arrive Day27 |
   need mayo Day28 | sell wagyu_smash_burger Days 28-29 @ $18"

bbq_sauce_perfect_match:
   "2.76kg exactly matches 50 BBQ Beef + 42 Pulled Pork"

These read like the notes of an operations manager, not a chatbot. The model is encoding plans in a format compact enough to fit a working memory budget, and structured enough that it can read the plan back the next morning and execute it. Compare to Qwen 3.5 397B and GLM-5, which wrote longform strategy notes that they could narrate but not act on.

It distrusts its own scratchpad

The most surprising behavior is that DeepSeek V4 Pro treats its own notes as suspect data. From the best run, day 5: “My inventory is actually much better than the scratchpad estimated… those were pre-delivery numbers.” This pattern repeats four-plus times across runs. The model writes a number, time passes, the simulation state changes, and the model catches the staleness rather than acting on the cached value.

That is a much more sophisticated agentic behavior than it sounds. Most models in this benchmark either ignore their own memory entirely or trust it blindly. DeepSeek V4 Pro is the first one we have tested that consistently re-checks its own notes against fresh tool reads before committing to an action.

Short reasoning, fast decisions

Reasoning tokens per run sit around 263K–356K — substantial, but not Sonnet-tier overthinking. The model spends its compute on numbered observation/decision blocks, not on agonizing prose. By the time the morning loop ends, there is a one-line decision (“Festival pricing + 4x traffic + limited window. Let’s execute.”) and the rest of the day is execution. Sonnet 4.6, by contrast, generates roughly 3× as many tokens to land on a worse decision.

Plans That Get Executed

A useful test of whether an agentic model is actually operating versus narrating is whether the plan it wrote yesterday matches the action it took today. DeepSeek V4 Pro passes this test more cleanly than any model we have tested.

PlanWrittenExecutedMatch
Cinco de Mayo (Day 5)Day 4: cinco_de_mayo_plan = downtown_business | hire_margo_day4 | order_heavy_day4Day 5: downtown_business, Margo hired Day 4, full menu, prices tuned slightly off-plan after seeing competitor data✓ Executed, with a price-tune adjustment
Food Truck Rally (Day 22)Day 21: rally_day22_prep = event_venue | vendor_200 | 6.0x | capacity_390Day 22: event_venue, 389 of 390 capacity served, $1,348 profit; 1,199 customers walked away unserved at 46-min wait✓ Plan capped at capacity, not at planning
Wagyu launch (Days 28–29)Day 24: wagyu_plan = 40 wagyu patties exp Day29 | artisan_buns arrive Day27 | sell wagyu_smash @ $18Days 28–29: wagyu_smash_burger added to menu at $19, ingredients pipelined across 5 calendar days, no expiry waste✓ Multi-day pipeline executed end-to-end
BBQ Festival sweep (Days 26–28, best run)Day 25 cost-benefit: “need ~$9.08 avg to beat industrial zone’s ~$1,877”Days 26–27–28 at waterfront_park: $2,359 / $2,359 / $2,213 profit (best 3 days of the run)✓ Picked the harder location for the math

The pattern is consistent: write a plan with explicit numbers, execute against the plan the next day, adjust pricing/menu where new information arrives, do not abandon the plan just because the morning feels different. This is the agentic behavior most models in this benchmark fail to do reliably. Qwen 3.5 397B writes plans and ignores them. Sonnet 4.6 writes plans and second-guesses them. DeepSeek V4 Pro writes plans and runs them.

In Its Own Words

The best DeepSeek V4 Pro lines are not the dramatic ones. They are the ones that read like a calm operator’s journal entry — short, numerical, and oriented toward tomorrow.

Diagnosing Its Own Loss In One Sentence

Now I see the disaster clearly. industrial_zone has a 0.05 weekend multiplier — Saturdays/Sundays are dead there. Day 20 was Saturday. That explains the collapse from 540 demand to 31.
— DeepSeek V4 Pro, Median run, Day 21 morning

Both the median run and the best run fell into the same Day 20 weekend trap on industrial_zone, and both diagnosed it by the next morning. Neither run repeated the mistake later. That is the difference between a model that learns and a model that narrates.

The Honest Engineering Note

My inventory is actually much better than the scratchpad estimated… those were pre-delivery numbers.
— DeepSeek V4 Pro, Best run, Day 5

The model is correcting its own memory against fresh tool reads. This pattern repeats four-plus times across runs and is the single biggest behavioral difference versus Qwen 3.5 397B and GLM-5, which trusted their notes blindly.

When The Menu Is Broken, Take The Day Off

Operating today with just chicken_wings + drinks would be a disaster for quality and reviews. Decision: Take today off.
— DeepSeek V4 Pro, Best run, Day 24

Stockouts simultaneously on tortillas, buns, ketchup, and salsa. Most models we have tested would either trim the menu and serve a degraded version, or improvise with whatever was in stock. DeepSeek V4 Pro chose the rest day. The best run finished at $35,488 net worth — on a benchmark with one fewer operating day.

Ingredient Math, Down To The Kilogram

BBQ sauce 2.76kg exactly matches 50 BBQ Beef + 42 Pulled Pork.
— DeepSeek V4 Pro, Median run, Day 29 — agent_memory key bbq_sauce_perfect_match

This is the kind of note a chef writes, not a model. The whole agent_memory dict is full of pipe-delimited shorthand at this resolution: ingredient quantities, expiry days, vendor fees, multipliers. The model is doing inventory engineering, not just ordering.

Lesson Written To Itself, Then Followed

Stockouts are profit killers. Always verify ALL recipe ingredients are in stock before setting menu. Never skip cheese_slices, onion, drink ingredients.
— DeepSeek V4 Pro, Best run, agent_memory key day9_lesson

Past Chinese models on this benchmark wrote sentences like this and then stocked out the next day anyway. DeepSeek V4 Pro’s best run logged 6 stockout events in 30 days — comparable to Gemma 4 31B and well below the previous Chinese cohort.

A One-Line Justification For A Hard Call

Festival pricing + 4x traffic + limited window. Let’s execute.
— DeepSeek V4 Pro, Best run, Day 25 morning

Right before this line is a full cost-benefit table comparing the festival vs the proven free Industrial Zone slot. The decision earned $2,359, $2,359, $2,213 over three days — the run’s top three. Compact reasoning, decisive output.

Where It Still Loses

Gemma 4 31B is still the cheapest survivor — but it is an anomaly

DeepSeek V4 Pro is the most cost-efficient frontier-quality model on the leaderboard. It is not the cheapest model that survives. Gemma 4 31B finishes with a median net worth of $24,878 at $0.20 per run — even after DeepSeek’s 75% promotional discount, Gemma is still about 4× cheaper. But Gemma is an open-weight 31B with extreme price-per-token economics ($0.13/M input, $0.38/M output) and a much wider outcome band: best run $29,089, worst run $11,148, with operational quality clearly behind DeepSeek V4 Pro on raw business metrics. Use Gemma when the run cost is the only variable you care about. Use DeepSeek V4 Pro when you want a real frontier-tier operator that does not blow up under variance.

Opus 4.6 still has the higher peak

Claude Opus 4.6’s best run finishes at $53,470 net worth at +2,573% ROI. DeepSeek V4 Pro’s best run finishes at $35,488. On peak performance, the frontier label is still earned. Where DeepSeek wins is consistency, price, and the absence of a $56-per-run tail in the cost distribution.

Late-run festival days blow up tool-call counts

Three of the five DeepSeek V4 Pro runs used 331–352 LLM calls. Two used 811–839. The split is not random: the high-call runs hit late-run festival days where set_menu rejected the model’s plan because of an ingredient shortage and the model retried. The median run logged 38 set_menu failures out of 67 attempts. Cached input keeps the dollar cost flat (~$3.51 vs $3.65 between low- and high-call runs), but the call ceiling is a real artifact and will show up as latency on a real workload.

The festival-vs-grind tradeoff is unsettled

On Day 22, the median run committed to event_venue for the Food Truck Rally and made $1,348 profit while leaving 1,199 customers in a 46-minute queue. On the same day, the best run skipped the rally entirely for industrial_zone and made $1,795 profit because of a $0 vendor fee. Same model. Opposite call. Both correct in context. DeepSeek V4 Pro can do the math, but the math sometimes points two different directions, and the run goes whichever way the morning analysis lands. There is more variance to squeeze out here.

Verdict

DeepSeek V4 Pro is the best Chinese model on FoodTruck Bench, and the best price-per-result model in the frontier ROI tier. It does not have the highest peak — that is still Opus 4.6. It does not have the lowest cost — that is still Gemma 4 31B. What it has is the cleanest combination of survival, scale, consistency, and price of any model we have tested.

Concretely: it finishes 30 days at $26,000+ net worth on five runs out of five, never takes a loan, keeps food waste under 1% of revenue, executes multi-day ingredient pipelines, and treats its own scratchpad as falsifiable data. It does all of that for $3.51 a run — within $0.06 of Grok 4.3 Latest, and at a fraction of the price of the Western frontier.

For agentic workloads at this difficulty band, DeepSeek V4 Pro is currently the model we would pick first. Grok 4.3 Latest matches its peak; nothing else matches both its peak and its floor at this price.

The Chinese cohort no longer has just a survivor. It has a model that competes at the frontier on the metric that matters most for an agentic deployment: the same number, every time, for a price that does not embarrass the bill. The lead time the Western frontier used to enjoy on this benchmark — comfortably measured in years — has compressed to about ten weeks.

Methodology

  • Benchmark: FoodTruck Bench — a 30-day agentic business simulation. The model runs a food truck through 34 tools with persistent memory and a daily reflection loop, against realistic demand, weather, events, competition, inventory decay, loans, and staffing constraints. Fixed $2,000 starting balance
  • Conditions: identical setup across all runs and all models; same simulation rules, same starting city (Austin, TX), same starting balance, same toolset
  • Primary model: OpenRouter deepseek/deepseek-v4-pro
  • Runs analyzed: 5 completed DeepSeek V4 Pro runs (April 26-28, 2026); article follows the selected median run 20260427_112226 with frequent reference to the best run 20260428_005038
  • Comparison medians: Grok 4.3 Latest 20260502_101639, Grok 4.20 Reasoning 20260323_020434, Claude Opus 4.6 20260216_005712, GPT-5.2 20260214_164339, Claude Sonnet 4.6 20260222_154504, Gemini 3 Pro 20260214_110034, Gemma 4 31B 20260403_224358, Qwen 3.6 Plus 20260407_124248
  • Cost figures: dollar costs reflect actual OpenRouter / vendor billing for each run as recorded in run_meta.json at the time the run was executed. DeepSeek V4 Pro’s OpenRouter rates were lowered shortly after our run set (current promo: $0.435/M input, $0.87/M output) — the in-article cost is the higher historical billed cost, so live re-runs at today’s rates would land below it. Cost-efficiency is computed as median net worth ÷ median cost per run, both taken from the same model’s run set
  • Snapshot date: article reflects the local benchmark snapshot as of May 3, 2026