Android Bench | Android Developers

3 min read Original article ↗

Android Bench

AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.

Android LLM Leaderboard

Model Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model

arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)

Avg latency (h) info Average time taken to solve 100 tasks across 10 runs

Avg total tokens (M) info Average token consumption for a full benchmark run (100 tasks) across 10 runs

Avg cost ($) info Average cost per full benchmark run

Date
GPT 5.5 74.0 66.8 — 80.5 15.5 64.5 $133.9 2026-04-27
GPT 5.4 72.4 65.4 — 79.3 21.2 64.2 $91.7 2026-03-16
Gemini 3.1 Pro Preview 72.4 65.1 — 78.8 11.5 75.4 $49.0 2026-02-27
Claude Opus 4 7 68.7 60.5 — 75.9 11.6 90.0 $124.3 2026-04-27
GPT 5.3 Codex 67.7 59.9 — 75.6 11.2 71.4 $42.6 2026-03-18
Claude Opus 4 6 66.6 59.1 — 74.1 9.9 69.5 $84.4 2026-02-26
GPT 5.2 Codex 62.5 54.4 — 70.0 24.3 124.4 $121.9 2026-02-27
Claude Opus 4.5 61.9 53.9 — 70.2 12.5 79.8 $102.5 2026-02-26
Gemini 3 Pro Preview 60.4 52.3 — 67.7 9.8 117.0 $63.7 2026-02-27
GLM 5.1 59.7 52.4 — 67.4 33.4 80.2 $46.7 2026-05-08
Claude Sonnet 4.6 58.4 50.3 — 66.4 8.2 47.9 $40.4 2026-03-01
Kimi K2.6 58.6 51.3 — 66.5 29.9 94.3 $42.5 2026-05-10
DeepSeek V4 Pro 55.4 47.5 — 63.6 35.8 132.7 $13.7 2026-05-08
Claude Sonnet 4.5 54.2 45.9 — 62.2 13.1 92.9 $60.3 2026-02-26
DeepSeek V4 Flash 52.7 45.3 — 60.7 28.1 164.7 $8.4 2026-05-11
MiMo 2.5 Pro 52.0 43.8 — 60.0 33.1 97.5 $74.5 2026-05-09
Qwen 3.6 Max Preview 51.4 43.5 — 59.3 20.5 103.0 $222.4 2026-05-07
Gemini 3 Flash Preview 42.0 36.6 — 47.3 16.5 148.0 $34.2 2026-02-26
MiniMax M2.7 37.2 30.3 — 44.9 20.3 128.3 $10.1 2026-05-01
Qwen 3.6 27B 37.4 30.5 — 44.5 20.7 112.3 $64.6 2026-05-05
Gemma 4 31B IT 33.2 26.2 — 40.8 14.2 29.5 $2.5 2026-05-01
Qwen 3.6 35B A3B 31.7 24.4 — 39.0 12.5 113.4 $10.7 2026-05-05
Gemini 2.5 Pro 29.1 22.3 — 36.1 8.4 37.9 $35.8 2026-03-02
Gemma 4 26B A4B IT 25.1 18.8 — 31.8 21.4 77.2 $3.3 2026-05-01
GPT OSS 120B 18.9 13.1 — 25.1 25.9 122.7 $7.6 2026-05-09
Gemini 2.5 Flash 15.9 10.7 — 21.1 4.9 108.8 $11.2 2026-02-26
Qwen 3.5 9B 15.5 10.1 — 20.9 16.6 181.4 $15.6 2026-05-07
GPT OSS 20B 2.4 1.2 — 3.9 3.8 12.0 $0.2 2026-05-11

Learn more about Android Bench