AI BENCHY Compare
Compared models
Last updated at: 2026-03-21
| Metric | Grok 4.20 Beta Grok 4.20 Beta medium Release: 2026-03-12 | Grok 4.20 Multi-Agent Beta Grok 4.20 Multi-Agent Beta medium Release: 2026-03-12 | Grok 4.1 Fast Grok 4.1 Fast medium Release: 2025-11-19 | Gemini 3 Flash Preview Gemini 3 Flash Preview medium Release: 2025-12-17 |
|---|---|---|---|---|
| Score | 7.9 | 6.2 | 6.9 | 10.0 |
| Rank | #23 | #47 | #38 | #1 |
| Consistency | 9.0 | 7.2 | 7.5 | 10.0 |
| Tests Correct | ||||
| Attempt pass rate | 72.6% | 54.9% | 66.7% | 100.0% |
| Flaky tests | 2 | 6 | 5 | 0 |
| Total Runs | 51 | 51 | 51 | 51 |
| Cost per result | 5.525 | 82.962 | 0.568 | 0.972 |
| Total Cost | $0.608 | $4.978 | $0.052 | $0.166 |
| Input Price | $2.000 / 1M | $2.000 / 1M | $0.200 / 1M | $0.500 / 1M |
| Output Price | $6.000 / 1M | $6.000 / 1M | $0.500 / 1M | $3.000 / 1M |
| Output Tokens | 1,487 | 298,948 | 1,189 | 1,640 |
| Reasoning Tokens | 87,922 | 296,529 | 84,595 | 48,270 |
| Response Time (avg) | 8.54s | 8.64s | 23.91s | 11.39s |
| Response Time (max) | 24.21s | 35.28s | 121.79s | 50.16s |
| Response Time (total) | 145.26s | 129.64s | 239.09s | 113.86s |
Score vs Total Cost
Response Time (avg)
Score vs Response Time (avg)
Total Output Tokens
Score vs Total Output Tokens
Category Breakdown
Quick Compare
Switch Comparison Pair