103-104
Typical Core LOC per Task
Behavior Composition by Model
Each model bar is normalized to 100%. Color encodes behavior category; hover segments to inspect percentage and raw action counts.
| Model | Agent | Organization | Tasks Passed | Test Case Pass Rate | Total Cost | Total Time |
|---|---|---|---|---|---|---|
| GPT-5.3 Codex | Codex CLI | OpenAI | 19/22 |
95.6% |
$213.07 | 24.8h |
| GPT-5.2 Codex | Codex CLI | OpenAI | 17/22 |
96.4% |
$435.72 | 108.6h |
| Claude Opus 4.6 | Claude Code | Anthropic | 15/22 |
90.8% |
$2055.81 | 76.4h |
| Claude Opus 4.5 | Claude Code | Anthropic | 10/22 |
81.7% |
$507.94 | 26.8h |
| Gemini 3 Flash | Gemini CLI | 2/6 |
49.8% |
$31.61 | 1.5h | |
| GLM-4.7 | Claude Code | Zhipu AI | 2/6 |
64.2% |
$4.86 | 4.2h |
| Kimi K2.5 | Kimi Code CLI | Moonshot AI | 2/6 |
92.0% |
N/A | 5.9h |
| DeepSeek V3.2 | Claude Code | DeepSeek | 1/6 |
16.7% |
$4.12 | 20.2h |
| Claude Sonnet 4.5 | Claude Code | Anthropic | 0/6 |
76.1% |
$40.67 | 1.9h |
| Gemini 3 Pro | Gemini CLI | 0/6 |
16.5% |
N/A | 1.8h | |
| Qwen3 Max | Claude Code | Alibaba | 0/6 |
13.9% |
$368.37 | 15.5h |
Easy Tier
| Model | Agent | Tasks Passed | Test Case Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Claude Code | 6/6 |
100.0% |
0.39h | 1092 | $56.69 |
| Claude Opus 4.6 | Claude Code | 6/6 |
100.0% |
0.45h | 1781 | $48.61 |
| Claude Sonnet 4.5 | Claude Code | 0/6 |
76.1% |
0.32h | 930 | $40.67 |
| DeepSeek V3.2 | Claude Code | 1/6 |
16.7% |
3.4h | 1070 | $4.12 |
| Gemini 3 Flash | Gemini CLI | 2/6 |
49.8% |
0.25h | 558 | $31.61 |
| Gemini 3 Pro | Gemini CLI | 0/6 |
16.5% |
0.30h | 710 | N/A |
| GLM-4.7 | Claude Code | 2/6 |
64.2% |
0.70h | 904 | $4.86 |
| GPT-5.2 Codex | Codex CLI | 6/6 |
100.0% |
0.81h | 1081 | $33.51 |
| GPT-5.3 Codex | Codex CLI | 6/6 |
100.0% |
0.28h | 1305 | $15.00 |
| Kimi K2.5 | Kimi Code CLI | 2/6 |
92.0% |
0.99h | 1163 | N/A |
| Qwen3 Max | Claude Code | 0/6 |
13.9% |
2.6h | 850 | $368.37 |
Medium Tier
| Model | Agent | Tasks Passed | Test Case Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Claude Code | 3/8 |
82.6% |
1.3h | 3304 | $208.43 |
| Claude Opus 4.6 | Claude Code | 5/8 |
93.6% |
3.5h | 4867 | $1183.94 |
| GPT-5.2 Codex | Codex CLI | 7/8 |
98.9% |
5.1h | 4702 | $287.17 |
| GPT-5.3 Codex | Codex CLI | 8/8 |
100.0% |
1.2h | 2575 | $114.14 |
Hard Tier
| Model | Agent | Tasks Passed | Test Case Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|---|
| Claude Opus 4.5 | Claude Code | 1/8 |
67.0% |
1.7h | 6603 | $242.82 |
| Claude Opus 4.6 | Claude Code | 4/8 |
81.2% |
5.7h | 10103 | $823.26 |
| GPT-5.2 Codex | Codex CLI | 4/8 |
91.2% |
7.8h | 9034 | $115.04 |
| GPT-5.3 Codex | Codex CLI | 5/8 |
87.9% |
1.7h | 6255 | $83.94 |