Android Bench

AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.

Model	Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model	arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)	Avg latency (h) info Average time taken to solve 100 tasks across 10 runs	Avg cost ($) info Average cost per full benchmark run
Claude Fable 5	84.5	79.9 — 88.8	8.0	$133.2
GPT 5.5	80.2	73.5 — 86.6	11.4	$138.3
Claude Sonnet 5	76.2	69.0 — 82.1	12.3	$99.9
GPT 5.4	74.1	66.0 — 80.9	8.4	$83.4
Gemini 3.1 Pro Preview	73.7	66.1 — 80.4	10.6	$87.4
Claude Opus 4.8	72.4	65.8 — 79.3	6.7	$88.0
GLM 5.2	72.2	65.3 — 78.7	38.9	$117.0
Gemini 3.5 Flash	71.1	63.6 — 78.2	28.3	$165.6
Kimi K2.7 Code	70.4	63.2 — 77.0	31.8	$48.1
Claude Opus 4.7	68.7	60.9 — 76.4	7.0	$96.5
Kimi K2.6	67.6	60.2 — 74.3	57.2	$49.4
Claude Sonnet 4.6	67.0	58.3 — 75.4	16.9	$127.6
MiniMax M3	63.6	56.3 — 70.3	26.0	$41.7
GLM 5.1	63.2	56.0 — 71.3	17.6	$53.5
Gemini 3 Flash Preview	62.5	54.2 — 70.0	13.1	$30.1
MiMo-V2.5-Pro	60.8	53.1 — 68.3	13.6	$9.2
Deepseek V4 Pro	59.5	51.7 — 66.9	9.0	$3.7
Qwen 3.7 Plus	57.7	49.5 — 65.5	18.5	$18.6
Deepseek V4 Flash	54.7	46.6 — 62.8	8.9	$1.5
Qwen 3.7 Max	54.2	46.3 — 61.8	14.2	$58.3
Qwen 3.6 27B	45.1	38.2 — 53.0	25.8	$97.3
MiniMax M2.7	41.6	34.4 — 49.0	18.2	$14.9
Qwen 3.6 35B A3B	37.0	29.5 — 44.4	16.3	$17.8
Gemma 4 31B IT	36.3	29.3 — 43.2	38.9	$10.6
Gemma 4 26B A4B IT	25.1	18.6 — 31.8	21.4	$3.3

Latest results as of July 8th.
View archived leaderboards and check back periodically for updates.