Research
Edwin Ong & Alex Vikati · mar-2026
What Codex Actually Chooses
(vs Claude Code)
We gave two flagship AI coding agents the same prompts across the same repos — 1,470 successful responses, yielding 1,452 analyzable tool picks. How does your AI coding agent shape the stack you build?
12 categories · 5 repos · 3 runs each
Claude Code v2.1.78 running Opus 4.6 · OpenAI Codex CLI 0.114.0 running GPT-5.3
The big finding: 7/12 categories agree on the top pick — 6 of 7 on Custom/DIY. The one exception: both pick Grafana for log aggregation.
Key signals: Statsig (27% Codex vs 0% Claude), Bun gap (63% Claude vs 13% Codex), plus divergent platform leanings: Codex favors Cloudflare-branded tools, Claude favors Vercel.
1,470
Total Responses
735 + 735
2
Agents
Codex CLI 0.114.0 / GPT-5.3
Claude Code v2.1.78 / Opus 4.6
7/12
Agreement
6 of 7 on Custom/DIY
1,452
Analyzable Picks
Codex 729 / Claude 723
These 12 categories are intentionally different from our original 20-category study. The original focused on full-stack infrastructure (CI/CD, payments, auth, ORM). This comparison targets categories where tool choice is more contested — areas like search, secrets, rate limiting, and edge compute where both agents have diverse opinions and the winner isn't obvious.
Repos Used
nextjs-saas
Next.js 14, TypeScript
python-api
FastAPI, Python 3.11
react-spa
Vite, React 18, TS
go-microservice
Go 1.22, Chi
ruby-rails-app
Rails 7, Ruby 3.3
The repo a prompt runs against shapes the recommendation. A Next.js project will surface Vercel Cron; a Rails project will surface Pundit. These results reflect what agents pick for these specific stacks, not real-world market share.
Head-to-Head: 12 Categories
Same prompts, same repos. The top pick each agent chose per category.
Agree on top pickDifferent top pick
Headline Findings
The Divergent Stack
5 categories where they disagree
Search, image/media, secrets, and scheduled tasks are where the default recommendation changes most clearly by agent.
JS Runtime & ToolchainNode.jsvs
Bun
SearchCustom/DIYvs
PostgreSQL FTS
SMS & Push NotificationsCustom/DIYvs
Twilio
Scheduled Tasks / Croncron (OS)vsAPScheduler / Vercel Cron
Edge & Serverless ComputeCloudflare Workersvs
Vercel Edge
The Ownership Question
Statsig: Codex 27% vs Claude 0% · Bun: Claude 63% vs Codex 13%
The acquired-tool gaps are clear in this benchmark: Codex recommends Statsig while Claude does not, and Claude recommends Bun far more often than Codex.
Correlation, not causation: These gaps show alignment between an agent and its parent company's acquired tools — but the causation arrow could point the other way. Bun and Statsig may have been acquisition targets precisely because they were best-in-class products, and the agents are simply reflecting that quality. We show the pattern because it's notable; we don't claim it's intentional.
Statsig primary pick rate
Platform Preferences
Cloudflare vs Vercel
In selected Cloudflare/Vercel brand-family counts, Codex leans toward Cloudflare while Claude leans toward Vercel.
Codex → Cloudflare Workers
47Cloudflare picks across study
Claude: 9 picks
Claude → Vercel Edge
29Vercel picks across study
Codex: 17 picks
The Ownership Question
Statsig and Bun are the clearest company-linked tools in the dataset. The data shows pick-rate gaps and conversion gaps; it does not identify the cause.
Statsig
OpenAI acquisition · Feature Flags
Ownership signal
| Agent | Primary | Mentioned | Responses |
|---|---|---|---|
| Codex | 27%(20) | 41%(31) | 75 |
| Claude Code | 0%(0) | 28%(21) | 75 |
Codex picks Statsig as primary 27% of the time. Opus picks it zero times out of 75 responses — but mentions it 28% of the time, so the gap is not just a simple awareness gap.
Bun
Anthropic acquisition · JS Runtime
Ownership signal
| Agent | Primary | Mentioned | Responses |
|---|---|---|---|
| Codex | 13%(4) | 73%(22) | 30 |
| Claude Code | 63%(19) | 97%(29) | 30 |
Claude recommends Bun at 63% — ~5× Codex's 13%. This is the largest acquired-tool gap in the study.
Both Agents Know These Tools Exist
These acquired-tool gaps are not just about awareness. Both agents mention the other company's tool; the difference is how often that mention becomes the primary recommendation.
| Tool | Agent | Mention % | Primary % | Conversion |
|---|---|---|---|---|
| Statsig | Codex | 41% | 27% | 64.5% |
| Claude | 28% | 0% | 0% | |
| Bun | Claude | 97% | 63% | 65.5% |
| Codex | 73% | 13% | 18.2% |
Claude mentions Statsig in 28% of feature flag responses but never recommends it as primary. Codex lists Bun as an option in 73% of JS runtime responses but rarely promotes it to #1. The safest conclusion is descriptive: conversion differs much more than awareness does.
Platform Preferences: Cloudflare vs Vercel
Beyond acquired tools, each agent leans toward a different cloud platform when recommending infrastructure. These are selected brand-family counts, not a full platform market share — but the directional preference is consistent across categories.
Codex → Cloudflare (47 picks across categories)
Edge/Serverless — Cloudflare Workers
Image & Media — Cloudflare Images
Claude → Vercel (29 picks across categories)
Edge/Serverless — Vercel Edge
Scheduled Tasks — Vercel Cron
Codex picks Cloudflare-branded tools 47 times across the study; Claude picks them 9 times. Claude picks Vercel-branded tools 29 times; Codex picks them 17 times. These are selected brand-family sums — not a complete platform accounting — but the directional lean is consistent across the categories where both brands appear.
Selected Codex-Leaning Checks
Acquired tool plus selected cloud-service rows
In this selected set, all four rows lean toward Codex. Statsig is the clearest company-linked example; the cloud rows are descriptive patterns rather than ownership claims.
Selected Claude-Leaning Checks
Acquired tool, web-ecosystem rows, and open-source controls
2 of 7 rows clear the 10-point threshold for Claude alignment: Bun (+50pp) and Vercel Edge (+17pp). The two open-source controls (PostgreSQL FTS, Meilisearch) are excluded from alignment labeling because they have no corporate tie. The remaining rows are neutral.
OpenAI announced plans to acquire Astral (makers of Ruff and uv) on March 19, 2026. We ran a dedicated Python tooling benchmark to measure pick-rate gaps for those tools — read the full Astral analysis.
All 12 Categories
Expand any category to see the full side-by-side breakdown with every tool both agents considered.
Up-and-Comers Worth Watching
Beyond category winners, several startup tools appear meaningfully in recommendations. Some show up in both agents; others are championed by only one. Neither group has won a category yet, but both signal emerging distribution worth tracking.
Cross-Agent Picks
DopplerSecret Management
Strongest startup signal — near-identical rates from both agents
UpstashRate Limiting
Quiet but consistent serverless Redis alternative
MeilisearchSearch
Modern search engine — Claude's preferred startup pick
AxiomLog Aggregation
Modern logging challenger both agents notice
Agent-Split Picks
TypesenseCodex
Codex's search startup pick — mirrors Claude's Meilisearch
OneSignalCodex
Codex's notification startup default
Fly.ioClaude
Claude's app platform preference for edge compute
StoryblokCodex
Codex's CMS pick when it doesn't build from scratch
UnleashClaude
Claude's open-source feature flag pick
InfisicalCodex
Codex's emerging open-source secrets pick
Always in the Conversation
These established tools earn consistent recommendations from both agents but never land the #1 spot in their category.
HashiCorp VaultSecret Management
3pts behind winner — both agents know it, neither leads with it
RedisRate Limiting
Near-identical rates from both agents as a runner-up
ContentfulHeadless CMS
Legacy CMS leader, consistently second to Sanity
PunditRBAC
Ruby-native authorization — strong in Rails, absent elsewhere
Firebase Cloud MessagingSMS & Push
Both agents mention FCM but lead with Twilio or OneSignal
AlgoliaSearch
Codex-only runner-up — Claude never picks it as primary
Search split: Meilisearch vs Typesense is another agent-split pick — Claude favors Meilisearch (19%), Codex favors Typesense (19%). Doppler is the strongest cross-agent startup signal at ~20% from both agents.
Build vs Buy
Custom/DIY rate by category, sorted by absolute delta. Overall rates are similar (Claude 33% vs Codex 28%), but category-level variance is high.
28%Codex overall DIY
33%Claude overall DIY
| Category | Codex Custom/DIY | Claude Custom/DIY | Delta |
|---|---|---|---|
| RBAC / Authorization | 55% | 81% | -26pp |
| Log Aggregation | 0% | 17% | -17pp |
| SMS & Push Notifications | 27% | 16% | +11pp |
| Edge & Serverless Compute | 24% | 13% | +11pp |
| Headless CMS | 24% | 33% | -9pp |
| Image & Media Processing | 27% | 35% | -8pp |
| Secret Management | 31% | 36% | -5pp |
| Search | 31% | 35% | -4pp |
| Scheduled Tasks / Cron | 12% | 15% | -3pp |
| Feature Flags & Experimentation | 40% | 41% | -1pp |
| Rate Limiting | 32% | 33% | -1pp |
Positive delta means Codex builds custom more often. Negative means Claude does. Categories with 0% on both sides are excluded.
Methodology
How we ran the comparison: same prompts, same repos, independent agents, structured extraction.
Agents
Claude CodeOpus 4.6, v2.1.78
OpenAI CodexGPT-5.3, codex-cli 0.114.0
Study Design
- 12 categories, 5 prompts each
- 5 repos (4 stacks + Rails)
- 3 independent runs per combo
- Structured tool extraction
Scale
- 1,470 total responses
- ~735 per agent
- Git-reset between prompts
- Worktree isolation per run
Repos Used
nextjs-saas
Next.js 14, TypeScript
Full-stack SaaS
python-api
FastAPI, Python 3.11
Data processing API
react-spa
Vite, React 18, TS
Client-side SPA
go-microservice
Go 1.22, Chi
Payment microservice
ruby-rails-app
Rails 7, Ruby 3.3
Team collaboration
For devtool companies
We run these benchmarks for individual companies too
Private dashboards showing how AI agents recommend your tool vs. competitors, across real codebases. See exactly where you win and where you lose.
Get your benchmarkGet notified when new benchmarks drop.
Explore the original study
This comparison builds on our original 2,430-response Claude Code study across 20 categories and 3 models. Dive into the full dataset.