Android Bench
AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.
Android LLM Leaderboard
| Model | Score (%) info Average percentage of 100 test cases successfully resolved across 10 runs for each model |
arrow_range Cl range (%) info Expected performance range, reflecting the results' statistical reliability (p-value < 0.05) |
Avg latency (h) info Average time taken to solve 100 tasks across 10 runs |
Avg total tokens (M) info Average token consumption for a full benchmark run (100 tasks) across 10 runs |
Avg cost ($) info Average cost per full benchmark run |
Date |
|---|---|---|---|---|---|---|
|
|
74.0 | 66.8 — 80.5 | 15.5 | 64.5 | $133.9 | 2026-04-27 |
|
|
72.4 | 65.4 — 79.3 | 21.2 | 64.2 | $91.7 | 2026-03-16 |
|
|
72.4 | 65.1 — 78.8 | 11.5 | 75.4 | $49.0 | 2026-02-27 |
|
|
68.7 | 60.5 — 75.9 | 11.6 | 90.0 | $124.3 | 2026-04-27 |
|
|
67.7 | 59.9 — 75.6 | 11.2 | 71.4 | $42.6 | 2026-03-18 |
|
|
66.6 | 59.1 — 74.1 | 9.9 | 69.5 | $84.4 | 2026-02-26 |
|
|
62.5 | 54.4 — 70.0 | 24.3 | 124.4 | $121.9 | 2026-02-27 |
|
|
61.9 | 53.9 — 70.2 | 12.5 | 79.8 | $102.5 | 2026-02-26 |
|
|
60.4 | 52.3 — 67.7 | 9.8 | 117.0 | $63.7 | 2026-02-27 |
|
|
59.7 | 52.4 — 67.4 | 33.4 | 80.2 | $46.7 | 2026-05-08 |
|
|
58.4 | 50.3 — 66.4 | 8.2 | 47.9 | $40.4 | 2026-03-01 |
|
|
58.6 | 51.3 — 66.5 | 29.9 | 94.3 | $42.5 | 2026-05-10 |
|
|
55.4 | 47.5 — 63.6 | 35.8 | 132.7 | $13.7 | 2026-05-08 |
|
|
54.2 | 45.9 — 62.2 | 13.1 | 92.9 | $60.3 | 2026-02-26 |
|
|
52.7 | 45.3 — 60.7 | 28.1 | 164.7 | $8.4 | 2026-05-11 |
|
|
52.0 | 43.8 — 60.0 | 33.1 | 97.5 | $74.5 | 2026-05-09 |
|
|
51.4 | 43.5 — 59.3 | 20.5 | 103.0 | $222.4 | 2026-05-07 |
|
|
42.0 | 36.6 — 47.3 | 16.5 | 148.0 | $34.2 | 2026-02-26 |
|
|
37.2 | 30.3 — 44.9 | 20.3 | 128.3 | $10.1 | 2026-05-01 |
|
|
37.4 | 30.5 — 44.5 | 20.7 | 112.3 | $64.6 | 2026-05-05 |
|
|
33.2 | 26.2 — 40.8 | 14.2 | 29.5 | $2.5 | 2026-05-01 |
|
|
31.7 | 24.4 — 39.0 | 12.5 | 113.4 | $10.7 | 2026-05-05 |
|
|
29.1 | 22.3 — 36.1 | 8.4 | 37.9 | $35.8 | 2026-03-02 |
|
|
25.1 | 18.8 — 31.8 | 21.4 | 77.2 | $3.3 | 2026-05-01 |
|
|
18.9 | 13.1 — 25.1 | 25.9 | 122.7 | $7.6 | 2026-05-09 |
|
|
15.9 | 10.7 — 21.1 | 4.9 | 108.8 | $11.2 | 2026-02-26 |
|
|
15.5 | 10.1 — 20.9 | 16.6 | 181.4 | $15.6 | 2026-05-07 |
|
|
2.4 | 1.2 — 3.9 | 3.8 | 12.0 | $0.2 | 2026-05-11 |