Amplifying — AI Benchmark Research

4 min read Original article ↗

Featured Study

Edwin Ong & Alex Vikati · feb-2026 · claude-code v2.1.39

What Claude Code Actually Chooses

We pointed Claude Code at real repos 2,430 times and watched what it chose. No tool names in any prompt. Open-ended questions only.

3 models · 4 project types · 20 tool categories · 85.3% extraction rate

Update: Sonnet 4.6 was released on Feb 17, 2026. We'll run the benchmark against it and update results soon.

The big finding: Claude Code builds, not buys. Custom/DIY is the most common single label extracted, appearing in 12 of 20 categories (though it spans categories while individual tools are category-specific). When asked “add feature flags,” it builds a config system with env vars and percentage-based rollout instead of recommending LaunchDarkly. When asked “add auth” in Python, it writes JWT + bcrypt from scratch. When it does pick a tool, it picks decisively: GitHub Actions 94%, Stripe 91%, shadcn/ui 90%.

2,430

Responses

3 models · 4 repos · 3 runs each

3

Models

Sonnet 4.5, Opus 4.5, Opus 4.6

20

Categories

CI/CD to Real-time

85.3%

Extraction Rate

2,073 parseable picks

90%

Model Agreement

18 of 20 within-ecosystem

Headline Findings

Build vs Buy

In 12 of 20 categories, Claude Code builds custom solutions rather than recommending tools. 252 total Custom/DIY picks, more than any individual tool. E.g., feature flags via config files + env vars, Python auth via JWT + passlib, caching via in-memory TTL wrappers.

Authentication (Python)100%

Authentication (overall)48%

The Default Stack

When Claude Code picks a tool, it shapes what a large and growing number of apps get built with. These are the tools it recommends by default:

Mostly JS-ecosystem. See report for per-ecosystem breakdowns.

6

ZustandStrong DefaultState Management

64.8%57/88 picks

7

SentryStrong DefaultObservability

63.1%101/160 picks

Against the Grain

Tools with large market share that Claude Code barely touches, and sharp generational shifts between models.

State Management

0 primary, but 23 mentions. Zustand picked 57x instead

API Layer

Absent entirely. Framework-native routing preferred

Testing

Only 4% primary, but 31 alt picks. Known but not chosen

Package Manager

1 primary, but 51 alt picks. Still well-known

The Recency Gradient

Newer models tend to pick newer tools. Within-ecosystem percentages shown. Each card tracks the two main tools in a race; remaining picks go to Custom/DIY or other tools.

Replaced by: Drizzle (21% → 100%)

Within JS ORM picks only

Replaced by: FastAPI BackgroundTasks (0% → 44%), rest Custom/DIY or non-extraction

Within Python job picks only (61% extraction rate). Custom/DIY = asyncio tasks, no external queue

Redis (caching)Python

Replaced by: Custom/DIY (0% → 50%), rest other tools

Within Python caching picks only

The Deployment Split

Deployment is fully stack-determined: Vercel for JS, Railway for Python. Traditional cloud providers got zero primary picks.

JS

Frontend (Next.js + React SPA)

86 of 86 frontend deployment picks. No runner-up.

PY

Backend (Python / FastAPI)

What you'd expect: AWS, GCP, Azure

What you get: Railway at 82%

Zero primary picks across all 112 deployment responses:

Never the primary choice, but some are frequently recommended as alternatives.

Frequently recommended as alternatives

Netlify 67 altCloudflare Pages 30 altGitHub Pages 26 altDigitalOcean 7 alt

Mentioned but never recommended (0 alt picks)

AWS Amplify 24 mentionsFirebase Hosting 7 mentionsAWS App Runner 5 mentions

Example: "Where should I deploy this?" (Next.js SaaS, Opus 4.5)

Vercel (Recommended) — Built by the creators of Next.js. Zero-config deployment, automatic preview deployments, edge functions. vercel deploy

Netlify — Great alternative with similar features. Good free tier.

AWS Amplify — Good if you're already in the AWS ecosystem.

Vercel gets install commands and reasoning. AWS Amplify gets a one-liner.

Truly invisible (rarely even mentioned)

AWS (EC2/ECS)Google CloudAzureHeroku

Where Models Disagree

All three models agree in 18 of 20 categories within each ecosystem. These 5 categories have genuine within-ecosystem shifts or cross-language disagreement.

CategorySonnet 4.5Opus 4.5Opus 4.6
ORM (JS)JSNext.js project. The strongest recency shift in the dataset.Prisma79%Drizzle60%Drizzle100%
Jobs (JS)JSNext.js project. BullMQ → Inngest shift in newest model.BullMQ50%BullMQ56%Inngest50%
Jobs (Python)PythonPython API project (61% extraction rate). Celery collapses in newer models.Celery100%FastAPI BgTasks38%FastAPI BgTasks44%
CachingCross-languageCross-language (Redis and Custom/DIY appear in both JS and Python)Redis71%Redis31%Custom/DIY32%
Real-timeCross-languageCross-language (SSE, Socket.IO, and Custom/DIY appear across stacks)SSE23%Custom/DIY19%Custom/DIY20%

Dig into the data

Category deep-dives, phrasing stability analysis, cross-repo consistency data, and market implications.