COMPOSITE CAPABILITY 100 represents the most capable model at this moment in time.
See Composite Capability sheet for complete calculation.
Calulation method updated 2025-03
97
100
96
95
95
95
94
94
94
93
90
90
88
87
87
87
86
85
85
84
83
83
82
89
81
81
81
80
79
79
77
77
77
76
75
75
74
73
72
72
71
69
68
69
68
68
67
67
67
67
67
66
66
66
66
66
65
65
64
64
63
63
61
59
58
58
57
56
56
56
55
54
54
53
52
52
52
52
51
51
51
51
50
50
50
49
49
49
47
45
44
44
43
43
43
42
40
40
40
6
Composite Capability Consistency 100 represents consistent performance across benchmarks relative to other models. Lower score indicates higher variance.
56
47
17
67
28
92
44
57
81
98
86
83
22
n/a
40
69
93
69
28
86
75
74
70
n/a
27
50
44
71
56
n/a
70
55
63
71
67
68
43
61
69
74
75
68
n/a
71
55
55
n/a
72
73
77
69
72
n/a
60
60
66
45
1
70
34
n/a
76
64
83
57
72
81
95
22
62
71
85
83
57
59
75
n/a
n/a
99
56
94
n/a
80
89
54
n/a
n/a
17
56
89
46
93
n/a
n/a
87
74
94
88
7
Artificial Analysis Intelligence Index V2 Weighted average of benchmarks for General Reasoning & Knowledge, Math, and Code Gen.
70
69
70
68
71
66
68
68
63
64
61
63
67
60
66
58
62
57
60
55
53
56
56
50
63
48
52
53
53
58
49
48
53
52
44
49
54
51
46
44
47
50
43
41
46
45
44
45
43
45
41
41
40
52
38
43
43
41
41
37
38
34
40
37
35
41
37
36
35
28
35
35
33
35
33
29
29
27
28
8
Artificial Analysis Index V1 Average result across MMLU, GPQA, Math, and HumanEval.V1 Index deprecated 2025-02
90
89
89
86
85
84
79
80
85
83
80
79
78
76
79
77
75
74
74
74
72
76
74
72
70
75
73
72
68
68
68
72
66
72
70
64
67
61
62
57
60
58
9
LM Arena ELO (style control enabled) Crowdsourced comparative evaluation of "vibes". Can be misleading - "answers people like" is not the same as "right answers", ie confidently wrong often beats nuanced & considerate.
1481
1470
1437
1446
1434
1346
1414
1382
1380
1417
1371
1376
1387
1390
1429
1422
1329
1395
1382
1391
1363
1346
1381
1358
1355
1311
1328
1355
1316
1332
1337
1345
1309
1293
1339
1286
1328
1284
1341
1320
1331
1324
1316
1342
1302
1254
1249
1235
1323
1237
1232
1240
1239
1240
1209
1234
1242
1241
1230
1232
1235
1214
1186
1229
1181
1181
1205
1182
1207
1204
1195
1207
10
LiveBench Global Average Questions refresh monthly, potential for stale values in spreadsheet.
Categories: Reasoning, Coding, Math, Data Analysis, Language, and Instruction Following.
78.72
78.59
69.39
76.69
74.42
80.71
71.99
76.45
79.25
79.53
79.09
75.34
68.33
75.88
71.52
74.40
75.67
65.93
74.50
72.49
69.65
64.75
70.01
69.93
66.86
62.99
60.03
65.13
65.56
59.05
66.92
71.03
56.59
57.76
56.95
60.45
57.94
54.38
65.32
57.79
54.50
61.47
54.33
60.27
62.29
56.64
55.33
53.24
45.55
51.44
49.99
50.40
46.88
52.36
54.30
50.16
49.18
41.25
41.61
45.98
49.12
39.72
43.55
41.26
43.96
48.59
43.45
44.89
42.55
36.35
38.19
33.39
11
Primary Dimensions:
1. Capability: output quality 2. Throughput: sec to first 1k tokens (latency + tokens/sec) 3. Cost: $ per tokenUnique to this sheet:
-
Composite Capability: normalized average of key Capability indexes and ELOs, providing a single score for "which model is most capable". - Composite Capability Consistency: standard deviation of normalized capability scores. High value indicates more consistent eval performance. - Efficiency Index: weighted, normalized score that reflects overall performance across Capability, Throughput, and Cost, answering "which model has the optimal balance of primary dimensions".2025 emerging trends:
-
Reasoning models breaking through Capability plateau, and reset price war with differentiated tier. Caveats: - Reasoning models generate significantly more tokens, greatly increasing both total cost and time to full response beyond what $/token and tokens/sec alone indicate. - Continued improvement in frontend non-reasoning LLMs has closed the gap with first-wave reasoning model capability. - Open models narrowed capability gap with Closed models, exerting more downward pressure on cost. - 2025 "smart" models are competitive in speed and cost with 2024's "fast & cheap" models. - Benchmarks lose validity and utility over time (saturation, contamination, test-specific optimization, etc). Purpose-specific and controlled evaluation required.2024 at a glance:
- Cost decreased significantly (15x price drop for "GPT-4" level capability in 1yr). - Capability incrementally improved. - Small models aka SLMs (low parameter count) increasingly capable, appropriate for tightly-focused use cases and potentially on-device.
12
EFFICIENCY INDEX (weighted, normalized) 100 represents the optimal balance of primary dimensions:
Throughput (median tokens/sec) Highly variable based on sample date.
83.5
93.0
124.4
13.3
47.3
93.3
45.0
47.3
48.8
55.4
130.8
341.5
118.3
48.8
76.1
29.0
16.1
62.4
36.2
292.6
65.7
170.0
193.0
93.5
166.1
7.1
52.1
115.4
87.9
71.0
118.4
62.4
79.0
213.3
70.4
56.0
131.1
34.7
33.2
57.7
56.4
90.2
170.0
68.9
292.6
94.1
107.4
80.2
47.6
35.5
33.9
56.6
64.5
184.4
184.1
47.6
27.3
26.8
82.5
31.0
83.3
64.5
59.5
101.2
52.2
59.2
38.7
36.4
53.6
8.9
29.5
184.2
73.9
60.9
75.9
128.8
181.4
68.4
28.3
102.4
66.6
128.0
16.8
64.4
76.2
44.0
16.9
62.2
68.1
153.3
129.9
14.7
79.5
63.2
15
Latency seconds (First Chunk)
8.87
2.08
15.3
89.72
9.62
2.26
6.63
9.62
1.98
1.91
9.09
3.5
9.64
1.98
5.06
15.85
1.05
1.17
19.65
0.13
1.56
8.42
0.37
5.57
0.64
0.94
0.63
0.44
1.1
1.35
12.32
1.17
0.63
4.33
13.2
0.77
0.99
0.74
1.08
1.37
0.94
0.4
8.42
0.74
0.13
2.57
0.54
1.02
3.06
0.86
3.58
0.56
1.16
0.47
0.36
0.49
1.65
0.88
0.84
0.73
2.35
1.09
0.27
0.29
0.74
0.25
0.66
0.16
0.51
1.81
2.24
0.43
0.37
0.45
1.19
0.38
1.24
0.51
0.48
0.26
0.13
0.24
1.04
0.75
0.32
0.52
16
Seconds per first 1K tokens output Is it fast?
20.85
12.83
23.34
164.91
30.78
12.97
28.86
30.78
22.46
19.96
16.74
6.43
18.09
22.46
18.19
50.38
63.32
17.20
47.24
3.55
16.77
14.30
5.55
16.26
6.66
142.18
19.82
9.11
12.48
15.43
20.77
17.20
13.29
9.02
27.41
18.62
8.62
29.57
31.16
18.70
18.67
11.49
14.30
15.25
3.55
13.20
9.85
13.49
24.09
29.04
33.08
18.24
16.65
5.89
5.79
21.51
38.25
38.14
12.97
32.98
14.35
16.60
17.09
10.17
19.90
17.15
26.51
27.61
19.18
113.92
36.16
5.86
13.91
16.88
14.36
8.14
5.51
15.86
35.34
10.28
15.50
8.07
59.52
15.53
13.26
22.97
59.17
17.13
15.45
6.84
8.22
68.03
12.58
15.82
17
Seconds per first 4k tokens output
56.77
45.07
47.45
390.47
94.26
45.11
95.54
94.26
83.91
74.11
39.67
15.21
43.45
83.91
57.59
153.97
250.12
65.29
130.03
13.80
62.42
31.95
21.10
48.34
24.72
565.91
77.39
35.10
46.62
57.69
46.10
65.29
51.27
23.08
70.05
72.17
31.50
116.05
121.42
70.71
71.86
44.77
31.95
58.80
13.80
45.07
37.78
50.91
87.18
113.60
121.57
71.29
63.14
22.16
22.09
84.58
148.06
149.91
49.35
129.72
50.37
63.13
67.53
39.82
77.37
67.85
104.07
109.96
75.19
450.24
137.93
22.15
54.53
66.15
53.87
31.44
22.05
59.71
141.34
39.57
60.58
31.51
238.10
62.11
52.65
91.15
236.69
65.39
59.53
26.41
31.31
272.11
50.31
63.29
18
COST
19
Cost Variants (Cached 1M input, queries…)
$7.500
$37.500
$0.550
$7.500
$1.500
$3.75write / $0.3read
$1.250
$0.025
$0.313
$1.250
$3.75write / $0.3read
$18.75write/$1.50read
$0.075
$0.019
$0.30write/$0.03read
20
Cost Uncached Input (1M tokens)
$1.100
$1.250
$1.250
$1.250
$20.000
$2.000
$1.250
$1.250
$2.000
$15.000
$3.000
$1.250
$0.300
$0.150
$1.100
$15.000
$1.100
$15.000
$75.000
$3.000
$0.550
$0.100
$3.000
$0.300
$5.000
$1.100
$0.150
$4.000
$2.000
$0.300
$15.000
$0.900
$3.000
$0.400
$0.100
$0.400
$3.000
$3.000
$0.900
$3.000
$0.200
$0.220
$0.300
$2.500
$0.100
$2.000
$0.100
$1.250
$3.000
$1.600
$2.500
$3.000
$0.075
$2.500
$0.120
$0.380
$0.300
$0.140
$10.000
$1.000
$3.500
$2.000
$0.600
$2.000
$5.000
$0.050
$0.170
$2.000
$1.070
$15.000
$0.100
$0.800
$0.150
$3.500
$0.100
$0.075
$0.800
$30.000
$0.900
$1.200
$0.100
$0.140
$0.630
$0.060
$2.000
$0.140
$0.800
$2.000
$0.075
$0.900
$3.000
$0.200
$3.000
21
Cost Output (1M tokens)
$4.400
$10.000
$10.000
$10.000
$80.000
$8.000
$10.000
$10.000
$8.000
$75.000
$15.000
$10.000
$0.500
$3.500
$4.400
$75.000
$4.400
$60.000
$150.000
$15.000
$2.190
$0.400
$15.000
$0.100
$15.000
$4.400
$0.600
$4.000
$8.000
$2.500
$60.000
$0.900
$15.000
$1.600
$0.500
$2.000
$12.000
$15.000
$0.900
$15.000
$0.800
$0.880
$0.100
$10.000
$0.400
$2.000
$0.400
$2.500
$15.000
$6.400
$10.000
$15.000
$0.300
$10.000
$0.180
$0.400
$0.500
$0.580
$30.000
$1.000
$3.500
$10.000
$0.600
$6.000
$15.000
$0.100
$0.680
$6.000
$1.140
$75.000
$0.400
$3.200
$0.600
$10.500
$0.300
$0.300
$4.000
$60.000
$0.900
$1.200
$0.300
$0.280
$0.650
$0.240
$8.000
$0.280
$0.800
$8.000
$0.300
$0.900
$15.000
$0.600
$3.000
22
Cost 1M tokens (3:1 input:output) Is it cheap?
$1.925
$3.438
$3.438
$3.438
$35.000
$3.500
$3.438
$3.438
$3.500
$30.000
$6.000
$3.438
$0.350
$0.988
$1.925
$30.000
$1.925
$26.250
$93.750
$6.000
$0.960
$0.175
$6.000
$0.250
$7.500
$1.925
$0.263
$4.000
$3.500
$0.850
$26.250
$0.900
$6.000
$0.700
$0.200
$0.800
$5.250
$6.000
$0.900
$6.000
$0.350
$0.385
$0.250
$4.375
$0.175
$2.000
$0.175
$1.563
$6.000
$2.800
$4.375
$6.000
$0.131
$4.375
$0.135
$0.385
$0.350
$0.250
$15.000
$1.000
$3.500
$4.000
$0.600
$3.000
$7.500
$0.063
$0.298
$3.000
$1.088
$30.000
$0.175
$1.400
$0.263
$5.250
$0.150
$0.131
$1.600
$37.500
$0.900
$1.200
$0.150
$0.175
$0.635
$0.105
$3.500
$0.175
$0.800
$3.500
$0.131
$0.900
$6.000
$0.300
$3.000
23
24
25
COST VS (CAPABILITY, THROUGHPUT)
26
Cost 1M (3:1 IO) tokens per Composite Capability point Is capability in line with cost?
$0.020
$0.034
$0.036
$0.036
$0.369
$0.037
$0.036
$0.037
$0.037
$0.324
$0.066
$0.038
$0.004
$0.011
$0.022
$0.346
$0.022
$0.308
$1.103
$0.071
$0.012
$0.002
$0.073
$0.003
$0.093
$0.024
$0.003
$0.050
$0.044
$0.011
$0.341
$0.012
$0.080
$0.009
$0.003
$0.011
$0.073
$0.084
$0.013
$0.088
$0.005
$0.006
$0.004
$0.065
$0.003
$0.030
$0.003
$0.023
$0.090
$0.042
$0.067
$0.091
$0.002
$0.067
$0.002
$0.006
$0.006
$0.004
$0.244
$0.017
$0.060
$0.069
$0.011
$0.053
$0.133
$0.001
$0.005
$0.056
$0.020
$0.563
$0.003
$0.027
$0.005
$0.102
$0.003
$0.003
$0.032
$0.751
$0.018
$0.024
$0.003
$0.004
$0.014
$0.002
$0.080
$0.004
$0.018
$0.082
$0.003
$0.021
$0.149
$0.008
$0.075
27
Cost 1M (3:1 IO) tokens per AA Intelligence Index v2 point
$0.028
$0.050
$0.049
$0.051
$0.493
$0.053
$0.051
$0.051
$0.056
$0.469
$0.098
$0.055
$0.005
$0.016
$0.029
$0.517
$0.423
$0.105
$0.016
$0.003
$0.113
$0.004
$0.150
$0.031
$0.005
$0.077
$0.066
$0.016
$0.016
$0.125
$0.013
$0.005
$0.016
$0.097
$0.118
$0.020
$0.136
$0.007
$0.008
$0.006
$0.107
$0.004
$0.044
$0.004
$0.035
$0.140
$0.062
$0.107
$0.003
$0.109
$0.003
$0.009
$0.006
$0.023
$0.085
$0.015
$0.081
$0.197
$0.002
$0.007
$0.081
$0.857
$0.004
$0.038
$0.007
$0.004
$0.005
$0.046
$0.026
$0.036
$0.004
$0.003
$0.121
$0.121
$0.011
$0.107
28
Cost 1M (3:1 IO) tokens per AA Index v1 point
$0.292
$0.011
$0.022
$0.305
$0.063
$0.011
$0.075
$0.024
$0.002
$0.020
$0.035
$0.056
$0.079
$0.002
$0.005
$0.200
$0.047
$0.008
$0.041
$0.104
$0.004
$0.041
$0.015
$0.429
$0.019
$0.004
$0.002
$0.024
$0.013
$0.018
$0.002
$0.003
$0.009
$0.002
$0.055
$0.003
$0.013
$0.015
$0.105
$0.005
$0.052
29
Cost 1M (3:1 IO) tokens per Chatbot Arena ELO point
$0.002
$0.002
$0.002
$0.002
$0.002
$0.001
$0.021
$0.001
$0.019
$0.066
$0.004
$0.001
$0.000
$0.004
$0.005
$0.001
$0.000
$0.003
$0.003
$0.019
$0.001
$0.004
$0.001
$0.000
$0.001
$0.004
$0.005
$0.001
$0.004
$0.000
$0.000
$0.003
$0.000
$0.001
$0.002
$0.003
$0.005
$0.000
$0.003
$0.000
$0.000
$0.012
$0.003
$0.003
$0.000
$0.000
$0.000
$0.002
$0.001
$0.024
$0.000
$0.001
$0.000
$0.004
$0.000
$0.001
$0.030
$0.001
$0.000
$0.001
$0.000
$0.003
$0.000
$0.001
$0.000
$0.001
$0.005
30
Cost 1M (3:1 IO) tokens per LiveBench point
$0.024
$0.044
$0.050
$0.045
$0.470
$0.043
$0.048
$0.045
$0.044
$0.377
$0.076
$0.046
$0.005
$0.025
$0.419
$0.026
$0.347
$1.422
$0.081
$0.013
$0.086
$0.116
$0.027
$0.004
$0.060
$0.056
$0.015
$0.092
$0.012
$0.003
$0.014
$0.091
$0.105
$0.015
$0.104
$0.007
$0.004
$0.076
$0.037
$0.003
$0.029
$0.100
$0.045
$0.079
$0.002
$0.003
$0.007
$0.007
$0.298
$0.021
$0.067
$0.074
$0.012
$0.153
$0.002
$0.007
$0.024
$0.611
$0.004
$0.032
$0.006
$0.003
$0.003
$0.037
$0.020
$0.004
$0.003
$0.021
$0.009
31
Cost 1M (3:1 IO) tokens per Throughput (first 1k tokens output) Is throughput in line with cost?
$0.092
$0.268
$0.147
$0.212
$0.114
$0.265
$0.119
$0.114
$1.336
$0.301
$0.021
$0.154
$0.106
$1.336
$0.106
$0.521
$1.481
$0.349
$0.020
$0.049
$0.358
$0.017
$1.351
$0.118
$0.039
$0.028
$0.177
$0.093
$2.104
$0.058
$0.349
$0.053
$0.007
$0.043
$0.609
$0.203
$0.029
$0.321
$0.019
$0.034
$0.017
$0.287
$0.049
$0.152
$0.018
$0.116
$0.249
$0.096
$0.240
$0.360
$0.022
$0.755
$0.006
$0.010
$0.009
$0.019
$0.455
$0.070
$0.211
$0.234
$0.059
$0.151
$0.437
$0.002
$0.011
$0.156
$0.010
$0.830
$0.030
$0.101
$0.016
$0.366
$0.018
$0.024
$0.101
$1.061
$0.088
$0.077
$0.019
$0.003
$0.041
$0.008
$0.152
$0.003
$0.047
$0.227
$0.019
$0.110
$0.088
$0.024
$0.190
32
33
34
BENCHMARKS
35
Scores are opportunistically sourced from HuggingFace leaderboards, model cards, docs & press releases, other 3rd parties.
Many reproducibility, precision, and accuracy issues due to selection bias & varied methodologies (n-shot, turns, CoT, benchmark variant…), susceptible to prompt sensitivity, construct validity, and contamination.
Models used by HuggingFace Leaderboard 2 (2024 June 26) are highlighted in GREEN.
Models used by HuggingFace Leaderboard 1 are highlighted in ORANGE. These benchmarks are increasingly outdated for various reasons, eg benchmark inaccuracies or they are effectively "solved" by frontier models (due to training to beat the benchmark, leaked criteria, or outdated level of difficulty).
36
General
37
MMLU PRO HuggingFace Leaderboard 2 Scores per category (Biology, Business, Engineering, …)
84.0
84.0
84.0
79.0
77.0
76.0
81.0
80.0
76.2
74.0
76.0
78.0
67.0
80.0
77.6
75.8
76.0
72.6
76.1
71.6
71.0
86.0
67.0
63.7
73.3
68.9
70.0
85.0
45.3
70.4
68.0
68.5
69.0
63.1
69.0
66.8
67.3
65.0
64.8
66.4
67.0
66.3
63.6
52.6
53.5
53.6
59.1
56.2
53.0
58.1
38
MMLU HuggingFace Leaderboard 1 Most widely reported, but many known issues.
86.9
92.0
91.0
85.9
90.8
85.2
87.0
89.0
88.0
86.0
87.0
88.7
88.7
85.0
78.6
86.4
88.6
86.0
74.5
84.8
84.0
86.8
85.9
82.0
81.9
80.6
75.0
81.0
86.4
86.0
86.0
82.0
78.5
83.8
80.5
81.2
75.2
78.9
82.0
83.2
83.8
39
Humanity's Last Exam Technical knowledge and reasoning via structured academic problems
18.8
12.1
8.35
10
8.57
11.1
5
4
6.67
5
7.2
4.05
4.85
5
4.89
5.15
2.62
4.43
5
5
3.9
3.97
5
0.84
5.53
40
HELM
90.8
93.8
88.5
90.8
85.8
80.3
72.2
88.5
70.1
79.3
91.5
83.8
82.7
70.8
74.2
73.3
53.0
41
Kagi LLM Benchmarking Omitted - scores highly variable over time