Current AI model rankings and latest top scores across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench 2.0 & Aider Polyglot — updated May 2026.
Source: swebench.com ↗
Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.
4 Benchmarks·109 Model Entries·Updated May 2026
SWE-Bench Verified Leaderboard: GPT-5.5 Takes #1
GPT-5.5 from OpenAI takes the new #1 spot on SWE-Bench Verified at 88.7% (OpenAI-reported, released April 23, 2026). Claude Opus 4.7 drops to 87.6% at #2 (April 16, 2026, 1M context), and GPT-5.3-Codex holds #3 at 85.0%. Below 81% the field is tight: Opus 4.5 (80.9%), Opus 4.6 (80.8%), DeepSeek V4 Pro Max (80.6%, new open-weight 1.6T MoE) and Gemini 3.1 Pro (80.6%) tie. Kimi K2.6 (80.2%, new open-weight) ties MiniMax M2.5. Other April entrants: Mistral Medium 3.5 (77.6%), Qwen3.6-27B (77.2%), Muse Spark (77.4%).
Terminal-Bench 2.0: Codex CLI + GPT-5.5 Takes the Lead
Codex CLI + GPT-5.5 hits 82.0% on Terminal-Bench 2.0 (April 23, 2026), the new outright #1. ForgeCode + GPT-5.4 holds 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. ForgeCode + Claude Opus 4.6 was revised down to 79.8% on the latest tbench run. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both sit at 78.4%. Factory.ai's Droid + GPT-5.3-Codex follows at 77.3%. Anthropic self-reports Claude Opus 4.7 at 69.4%, still pending tbench.ai submission.
SWE-Bench Pro: Claude Opus 4.7 Leads at 64.3%
On SWE-Bench Pro, Claude Opus 4.7 leads at 64.3% (Anthropic-reported, April 2026 release). GPT-5.4 (xHigh) reaches 59.1% on Scale's SEAL mini-swe-agent scaffold. GPT-5.3-Codex (agent system) scores 56.8%, GPT-5.2-Codex 56.4%, and Muse Spark from Meta 55.0%. Claude Opus 4.6 scores 51.9% on the SEAL mini-swe-agent harness. Scale's fully standardized SEAL board puts Claude Opus 4.5 in the lead at 45.9%.
Open-Source Models on SWE-Bench 2026
MiniMax M2.5 leads open-weight models on SWE-Bench Verified at 80.2%, still in the top 10 overall. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI follows at 77.8% with 744B parameters trained on Huawei chips. Kimi K2.5 from Moonshot AI scores 76.8%. GLM-4.7 reaches 73.8% (corrected upward from earlier reports). DeepSeek V3.2 hits 73.0%, and Qwen3-Coder-Next achieves 70.6% with only 3B active parameters.
Best AI Coding Model May 2026
Claude Opus 4.7 is the clear overall leader in May 2026 — 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro, both #1. GPT-5.3-Codex follows at 85.0% on SWE-Bench Verified. Claude Sonnet 4.6 punches above its weight at 79.6% — still only 1.2 points behind Opus 4.6 and 5x cheaper.
For terminal and DevOps workflows, ForgeCode scaffolds with Claude Opus 4.6 or GPT-5.4 top Terminal-Bench 2.0 at 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. On multi-language editing (Aider Polyglot), Claude Opus 4.5 leads at 89.4% (Anthropic-reported), with GPT-5 (high) at 88.0%.
Budget-conscious? DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at $1.30/run — 22x cheaper than GPT-5. Qwen3-Coder-Next scores 70.6% on SWE-Bench Verified using only 3B active parameters, the most efficient model in the top 25.
Frequently Asked Questions
GPT-5.5 from OpenAI takes the new #1 spot at 88.7% (OpenAI-reported, released April 23, 2026). Claude Opus 4.7 from Anthropic drops to #2 at 87.6% (April 16, 2026, 1M context). GPT-5.3-Codex follows at 85.0%. Next: Claude Opus 4.5 at 80.9%, Claude Opus 4.6 at 80.8%, DeepSeek V4 Pro Max at 80.6% (new open-weight 1.6T MoE), Gemini 3.1 Pro at 80.6%, Kimi K2.6 at 80.2% (new open-weight), MiniMax M2.5 at 80.2%, and GPT-5.2 at 80.0%.
GPT-5.5 from OpenAI scores 88.7% on SWE-Bench Verified (OpenAI-reported, released April 23, 2026) — the new public leader, surpassing Claude Opus 4.7 by 1.1 points. On Terminal-Bench 2.0, Codex CLI + GPT-5.5 takes #1 at 82.0%. SWE-Bench Pro and Aider Polyglot scores are not yet published.
Claude Opus 4.7 scores 87.6% on SWE-Bench Verified (now #2 behind GPT-5.5 at 88.7%), 64.3% on SWE-Bench Pro (still #1, Anthropic-reported), and Anthropic reports 69.4% on Terminal-Bench 2.0 (not yet on the public tbench.ai board). Released April 16, 2026 with 1M token context.
Claude Opus 4.6 scores 80.8% on SWE-Bench Verified and 51.9% on SWE-Bench Pro (Scale SEAL mini-swe-agent). On Terminal-Bench 2.0, Opus 4.6 reaches 74.7% (Terminus-KIRA scaffold), and ForgeCode + Opus 4.6 tops the board at 81.8%. Released January 2026.
Gemini 3.1 Pro from Google DeepMind scores 80.6% on SWE-Bench Verified as of February 2026. On Terminal-Bench 2.0, TongAgents + Gemini 3.1 Pro reaches 80.2% and ForgeCode + Gemini 3.1 Pro reaches 78.4%. On SWE-Bench Pro (Scale SEAL mini-swe-agent), it scores 46.1%.
xAI self-reports 72-75% for Grok 4 on SWE-Bench Verified. Independent testing by vals.ai with SWE-agent scaffold shows 58.6% — a significant gap that highlights how scaffold choice affects results. On Aider Polyglot, Grok 4 scores 79.6%. xAI has since released Grok 4.20, now its current flagship.
Claude Sonnet 4.6 scores 79.6% on SWE-Bench Verified, only 1.2 points behind Opus 4.6 and 2.4 points ahead of Sonnet 4.5. At $3/$15 per million tokens — five times cheaper than Opus — it offers strong cost-efficiency for coding tasks.
Claude Opus 4.7 leads SWE-Bench Pro at 64.3% (Anthropic-reported, April 2026 release). GPT-5.4 (xHigh) scores 59.1% on Scale SEAL mini-swe-agent. Agent-system scores: GPT-5.3-Codex CLI at 56.8%, GPT-5.2-Codex at 56.4%, GPT-5.2 at 55.6%. Muse Spark from Meta reaches 55.0%. On Scale SEAL standardized scaffolding, Claude Opus 4.5 leads at 45.9%.
GPT-5.5 leads SWE-Bench Verified at 88.7% (OpenAI-reported) as of May 2026. Claude Opus 4.7 follows at 87.6% and still leads SWE-Bench Pro at 64.3% (Anthropic-reported). GPT-5.3-Codex reaches 85.0% on SWE-Bench Verified. On Terminal-Bench 2.0, Codex CLI + GPT-5.5 takes #1 at 82.0%, ForgeCode + GPT-5.4 follows at 81.8%. On Aider Polyglot, Claude Opus 4.5 leads at 89.4% (Anthropic-reported). DeepSeek V3.2-Exp offers the best cost-efficiency at $1.30 per run.
The top score jumped from around 65% in early 2025 to 88.7% in May 2026 with GPT-5.5 (OpenAI, April 23, 2026). Claude Opus 4.7 is #2 at 87.6%, GPT-5.3-Codex #3 at 85.0%. Gemini 3.1 Pro and DeepSeek V4 Pro Max tie at 80.6%. April 2026 new entrants: GPT-5.5 (88.7%), Claude Opus 4.7 (87.6%), DeepSeek V4 Pro Max (80.6%, open-weight), Kimi K2.6 (80.2%, open-weight), Qwen3.6 Plus (78.8%), Mistral Medium 3.5 (77.6%), Muse Spark (77.4%), Qwen3.6-27B (77.2%). Agent frameworks outperform raw model scores by 5-15 points.
This leaderboard is updated monthly with the latest benchmark scores from SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, and SWE-Bench Pro. Scores are self-reported by model providers unless noted. Scaffold/harness differences affect results.
For pure code generation, GPT-5.5 leads SWE-Bench Verified at 88.7%; Claude Opus 4.7 leads SWE-Bench Pro at 64.3%. For terminal and DevOps workflows, Codex CLI + GPT-5.5 tops Terminal-Bench 2.0 at 82.0%. For multi-language editing, Claude Opus 4.5 leads Aider Polyglot at 89.4%. For cost-efficiency, DeepSeek V3.2-Exp delivers 74.2% on Aider Polyglot at just $1.30 per run.
DeepSeek V4 Pro Max leads open-weight on SWE-Bench Verified at 80.6% (1.6T MoE, April 2026), tied with closed-source Gemini 3.1 Pro. Kimi K2.6 follows at 80.2% (April 2026, 1T MoE), tied with MiniMax M2.5. MiMo-V2-Pro from Xiaomi reaches 78.0% with 1T parameters. GLM-5 from Zhipu AI: 77.8% (744B). Mistral Medium 3.5: 77.6% (April 2026). Qwen3.6-27B: 77.2% (April 2026, dense Apache 2.0). Kimi K2.5: 76.8%. GLM-4.7: 73.8%. DeepSeek V3.2: 73.0%. Qwen3-Coder-Next: 70.6% with only 3B active parameters.
Codex CLI + GPT-5.5 is the new outright #1 at 82.0% (April 23, 2026, OpenAI). ForgeCode + GPT-5.4 holds 81.8%. TongAgents + Gemini 3.1 Pro reaches 80.2%. ForgeCode + Claude Opus 4.6 was revised to 79.8% on the latest tbench run. SageAgent + GPT-5.3-Codex and ForgeCode + Gemini 3.1 Pro both hit 78.4%. Droid + GPT-5.3-Codex from Factory.ai scores 77.3%. Anthropic reports Claude Opus 4.7 at 69.4% (not yet on the public tbench.ai board).