LLM capability, cost, & throughput (www.harlanlewis.com)

5 min read Original article ↗

1

MODELo4-mini (high)gpt-5 (high)Gemini 2.5 Pro Preview 06-05Gemini 2.5 Pro Experimentalo3 proo3 (high)Gemini 2.5 Pro Preview 05-06gpt-5 (medium)o3 (medium)Claude 4 Opus ThinkingClaude 4 Sonnet Thinkinggpt-5 (low)Grok 3 Mini Reasoning (high)Gemini 2.5 Flash (Thinking) Previewo3-mini (high)Claude 4 Opuso4-mini (medium)o1GPT-4.5 PreviewClaude 3.7 Sonnet ThinkingDeepSeek R1Gemini 2.5 Flash Lite (Reasoning) Preview 06-17Claude 4 SonnetGLM-4.6Qwen 3 30B A3B (Reasoning)GPT-4o 2025-03-27o3-mini (medium)Gemini 2.5 Flash PreviewDeepSeek V3 0324GPT-4.1Gemini 2.5 Flasho1-previewQwen 2.5 MaxGemini 2.0 Pro ExperimentalClaude 3.7 SonnetGPT-4.1 miniGemini 2.0 Flash Thinking ExperimentalQwen 3 32BMistral Medium 3o1-miniGrok 3DeepSeek V3Claude 3.5 Sonnet New 2024-10Reka Flash 3 PreviewLlama 4 MaverickQwen 3 30B A3BGPT-4o 2024-11-20Gemini 2.5 Flash Lite Preview 06-17DeepSeek R1 Distill Llama 70BGemini 2.0 FlashGemini 1.5 Pro 002Sonar ProQwen 2.5 MaxPhi-4 Reasoning PlusGPT-4o 2024-08-06Claude 3.5 Sonnet 2024-06Gemini 2.0 Flash-LiteCommand ADeepSeek R1 Distill Qwen 32BQwen 2.5 (72B)Gemma 3 27BLlama 4 ScoutGPT-4 TurboSonarLlama 3.1 405BGrok-2 1212Llama 3.3 70BPixtral LargeGrok BetaGemma 3 12BPhi-4Mistral Large 2 2024-07-24DeepSeek-V2.5Claude 3 OpusGPT-4.1 nanoNova ProGPT-4o miniHunyuan-Large 2025-02Gemini 1.5 Pro 001Mistral Small 3 2025-03Gemini 1.5 Flash 002Claude 3.5 HaikuGPT-4Llama 3.1 70BLlama 3.2 90BMistral Small 3 2025-01Hunyuan-Standard 2025-02DeepSeek-V2Qwen 2 (72B)Nova LiteJamba 1.5 LargeDeepSeek-Coder-V2Gemma 2 27BJamba 1.6 LargeGemini 1.5 Flash 001Llama 3 70BReka CoreMistral Small 2024-09Yi-Large

2

Version (date, alias, etc)4/16/20258/7/20256/5/20253/25/20256/10/20254/16/20255/6/20258/7/20254/16/20255/22/20255/22/20258/7/20254/9/20254/17/20252/13/20255/22/20254/16/202512/17/20242/27/20252/24/20251/20/20256/17/20255/22/20259/30/20254/28/20252025-03-0271/31/20254/17/20253/24/20254/14/20256/17/2025

preview-2024-09-12

3/5/20252/5/20252/24/20254/14/202512/19/20244/28/20255/7/2025

preview-2024-09-12

3/17/202512/6/202410/22/20243/10/20254/5/20254/28/202511/20/20246/17/20251/1/20252/5/20251.5-pro-0023/7/20251/29/20255/1/20258/6/20246/20/20242/5/20253/13/20251/29/20252025-03–124/5/20250125-preview1/25/20257/23/202412/12/202412/6/2021411/18/20248/13/20242025-03–127/24/20244/14/202512/3/20242/10/20253/17/20251.5-flash-00210/22/20241/1/20252/10/202512/3/20243/13/20257/22/20249/1/2024

3

Provider (for Throughput and Cost metrics)OpenAIOpenAIGoogle VertexGoogleOpenAIOpenAIGoogle VertexOpenAIOpenAIGoogle VertexGoogle VertexOpenAIxAIGoogleOpenAIGoogle VertexOpenAIOpenAIOpenAIAnthropicDeepSeekGoogle AI StudioGoogle Vertexz.AIDeepInfraOpenAIOpenAIGoogleHyperbolicOpenAIGoogle AI StudioOpenAIFireworksGoogleAnthropicOpenAIGoogleParasailMistralOpenAIxAIFireworks AIAnthropicRekaFireworksDeepInfraOpenAIGoogle AI StudioTogetherGoogleGooglePerplexityAlibabaDeepInfraOpenAIAnthropicGoogleCohereDeepInfraAlibabaParasailParasailOpenAIPerplexityTogetherxAIFriendliMistralxAIDeepInfraDeepInfraMistralAnthropicOpenAIAWSOpenAIGoogleMistralGoogleAnthropicOpenAIFireworks AIFireworks AIMistralTogether.aiAWSAmazon BedrockTogether.aiAmazon BedrockGoogleFireworks AIRekaMistralFireworks AI

4

Year Released202520252025202520252025202520252025202520252025202520252025202520252024202520252025202520252025202520252025202520252025202520242025202520252025202420252025202420252024202420242025202520242025202520252024202520252025202420242025202520252024202520252023202520242024202420242024202520242024202420242025202420242025202420252024202420232024202420252025202420242024202420242024202520242024202420242024

5

COMPOSITE CAPABILITY
100 represents the most capable model at this moment in time.

See Composite Capability sheet for complete calculation.

Calulation method updated 2025-03

9710096959595949494939090888787878685858483838289818181807979777777767575747372727169686968686767676767666666666665656464636361595858575656565554545352525252515151515050504949494745444443434342404040

6

Composite Capability Consistency
100 represents consistent performance across benchmarks relative to other models. Lower score indicates higher variance.56471767289244578198868322n/a406993692886757470n/a2750447156n/a705563716768436169747568n/a715555n/a7273776972n/a6060664517034n/a766483577281952262718583575975n/an/a995694n/a808954n/an/a1756894693n/an/a87749488

7

Artificial Analysis Intelligence Index V2
Weighted average of benchmarks for General Reasoning & Knowledge, Math, and Code Gen.70697068716668686364616367606658625760555356565063485253535849485352444954514644475043414645444543454141405238434341413738344037354137363528353533353329292728

8

Artificial Analysis Index V1
Average result across MMLU, GPQA, Math, and HumanEval.
V1 Index deprecated 2025-02908989868584798085838079787679777574747472767472707573726868687266727064676162576058

9

LM Arena ELO (style control enabled)
Crowdsourced comparative evaluation of "vibes".
Can be misleading - "answers people like" is not the same as "right answers", ie
confidently wrong often beats nuanced & considerate.148114701437144614341346141413821380141713711376138713901429142213291395138213911363134613811358135513111328135513161332133713451309129313391286132812841341132013311324131613421302125412491235132312371232124012391240120912341242124112301232123512141186122911811181120511821207120411951207

10

LiveBench Global Average
Questions refresh monthly, potential for stale values in spreadsheet.

Categories: Reasoning, Coding, Math, Data Analysis, Language, and Instruction Following.

78.7278.5969.3976.6974.4280.7171.9976.4579.2579.5379.0975.3468.3375.8871.5274.4075.6765.9374.5072.4969.6564.7570.0169.9366.8662.9960.0365.1365.5659.0566.9271.0356.5957.7656.9560.4557.9454.3865.3257.7954.5061.4754.3360.2762.2956.6455.3353.2445.5551.4449.9950.4046.8852.3654.3050.1649.1841.2541.6145.9849.1239.7243.5541.2643.9648.5943.4544.8942.5536.3538.1933.39

11

Primary Dimensions:

1. Capability: output quality
2.
Throughput: sec to first 1k tokens (latency + tokens/sec)
3.
Cost: $ per tokenUnique to this sheet:

-

Composite Capability: normalized average of key Capability indexes and ELOs, providing a single score for "which model is most capable".
-
Composite Capability Consistency: standard deviation of normalized capability scores. High value indicates more consistent eval performance.
-
Efficiency Index: weighted, normalized score that reflects overall performance across Capability, Throughput, and Cost, answering "which model has the optimal balance of primary dimensions".2025 emerging trends:

-

Reasoning models breaking through Capability plateau, and reset price war with differentiated tier. Caveats:
- Reasoning models generate significantly more tokens, greatly increasing both total cost and time to full response beyond what $/token and tokens/sec alone indicate.
- Continued improvement in frontend non-reasoning LLMs has closed the gap with first-wave reasoning model capability.
-
Open models narrowed capability gap with Closed models, exerting more downward pressure on cost.
- 2025 "smart" models are competitive in speed and cost with 2024's "fast & cheap" models.
- Benchmarks lose validity and utility over time (saturation, contamination, test-specific optimization, etc). Purpose-specific and controlled evaluation required.
2024 at a glance:

- Cost decreased significantly (15x price drop for "GPT-4" level capability in 1yr).
- Capability incrementally improved.
-
Small models aka SLMs (low parameter count) increasingly capable, appropriate for tightly-focused use cases and potentially on-device.

12

EFFICIENCY INDEX (weighted, normalized)
100 represents the optimal balance of primary dimensions:

- Capability: 2x weight
- Throughput: sqrt
- Cost: sqrt

90n/a918864869086858185n/a8890867885736483799082n/a858583876581847682n/a7882n/a78797974757578797876827780777272n/a7474807875717176667371717469697069685659747070n/a6872736856706871n/a616768625865646967526562

13

THROUGHPUT

14

Throughput (median tokens/sec)
Highly variable based on sample date.83.593.0124.413.347.393.345.047.348.855.4130.8341.5118.348.876.129.016.162.436.2292.665.7170.0193.093.5166.17.152.1115.487.971.0118.462.479.0213.370.456.0131.134.733.257.756.490.2170.068.9292.694.1107.480.247.635.533.956.664.5184.4184.147.627.326.882.531.083.364.559.5101.252.259.238.736.453.68.929.5184.273.960.975.9128.8181.468.428.3102.466.6128.016.864.476.244.016.962.268.1153.3129.914.779.563.2

15

Latency seconds (First Chunk)8.872.0815.389.729.622.266.639.621.981.919.093.59.641.985.0615.851.051.1719.650.131.568.420.375.570.640.940.630.441.11.3512.321.170.634.3313.20.770.990.741.081.370.940.48.420.740.132.570.541.023.060.863.580.561.160.470.360.491.650.880.840.732.351.090.270.290.740.250.660.160.511.812.240.430.370.451.190.381.240.510.480.260.130.241.040.750.320.52

16

Seconds per first 1K tokens output
Is it fast?20.8512.8323.34164.9130.7812.9728.8630.7822.4619.9616.746.4318.0922.4618.1950.3863.3217.2047.243.5516.7714.305.5516.266.66142.1819.829.1112.4815.4320.7717.2013.299.0227.4118.628.6229.5731.1618.7018.6711.4914.3015.253.5513.209.8513.4924.0929.0433.0818.2416.655.895.7921.5138.2538.1412.9732.9814.3516.6017.0910.1719.9017.1526.5127.6119.18113.9236.165.8613.9116.8814.368.145.5115.8635.3410.2815.508.0759.5215.5313.2622.9759.1717.1315.456.848.2268.0312.5815.82

17

Seconds per first 4k tokens output56.7745.0747.45390.4794.2645.1195.5494.2683.9174.1139.6715.2143.4583.9157.59153.97250.1265.29130.0313.8062.4231.9521.1048.3424.72565.9177.3935.1046.6257.6946.1065.2951.2723.0870.0572.1731.50116.05121.4270.7171.8644.7731.9558.8013.8045.0737.7850.9187.18113.60121.5771.2963.1422.1622.0984.58148.06149.9149.35129.7250.3763.1367.5339.8277.3767.85104.07109.9675.19450.24137.9322.1554.5366.1553.8731.4422.0559.71141.3439.5760.5831.51238.1062.1152.6591.15236.6965.3959.5326.4131.31272.1150.3163.29

18

COST

19

Cost Variants (Cached 1M input, queries…)$7.500$37.500$0.550$7.500$1.500$3.75write / $0.3read$1.250$0.025$0.313$1.250$3.75write / $0.3read

$18.75write/$1.50read

$0.075$0.019$0.30write/$0.03read

20

Cost Uncached Input (1M tokens)$1.100$1.250$1.250$1.250$20.000$2.000$1.250$1.250$2.000$15.000$3.000$1.250$0.300$0.150$1.100$15.000$1.100$15.000$75.000$3.000$0.550$0.100$3.000$0.300$5.000$1.100$0.150$4.000$2.000$0.300$15.000$0.900$3.000$0.400$0.100$0.400$3.000$3.000$0.900$3.000$0.200$0.220$0.300$2.500$0.100$2.000$0.100$1.250$3.000$1.600$2.500$3.000$0.075$2.500$0.120$0.380$0.300$0.140$10.000$1.000$3.500$2.000$0.600$2.000$5.000$0.050$0.170$2.000$1.070$15.000$0.100$0.800$0.150$3.500$0.100$0.075$0.800$30.000$0.900$1.200$0.100$0.140$0.630$0.060$2.000$0.140$0.800$2.000$0.075$0.900$3.000$0.200$3.000

21

Cost Output (1M tokens)$4.400$10.000$10.000$10.000$80.000$8.000$10.000$10.000$8.000$75.000$15.000$10.000$0.500$3.500$4.400$75.000$4.400$60.000$150.000$15.000$2.190$0.400$15.000$0.100$15.000$4.400$0.600$4.000$8.000$2.500$60.000$0.900$15.000$1.600$0.500$2.000$12.000$15.000$0.900$15.000$0.800$0.880$0.100$10.000$0.400$2.000$0.400$2.500$15.000$6.400$10.000$15.000$0.300$10.000$0.180$0.400$0.500$0.580$30.000$1.000$3.500$10.000$0.600$6.000$15.000$0.100$0.680$6.000$1.140$75.000$0.400$3.200$0.600$10.500$0.300$0.300$4.000$60.000$0.900$1.200$0.300$0.280$0.650$0.240$8.000$0.280$0.800$8.000$0.300$0.900$15.000$0.600$3.000

22

Cost 1M tokens (3:1 input:output)
Is it cheap?$1.925$3.438$3.438$3.438$35.000$3.500$3.438$3.438$3.500$30.000$6.000$3.438$0.350$0.988$1.925$30.000$1.925$26.250$93.750$6.000$0.960$0.175$6.000$0.250$7.500$1.925$0.263$4.000$3.500$0.850$26.250$0.900$6.000$0.700$0.200$0.800$5.250$6.000$0.900$6.000$0.350$0.385$0.250$4.375$0.175$2.000$0.175$1.563$6.000$2.800$4.375$6.000$0.131$4.375$0.135$0.385$0.350$0.250$15.000$1.000$3.500$4.000$0.600$3.000$7.500$0.063$0.298$3.000$1.088$30.000$0.175$1.400$0.263$5.250$0.150$0.131$1.600$37.500$0.900$1.200$0.150$0.175$0.635$0.105$3.500$0.175$0.800$3.500$0.131$0.900$6.000$0.300$3.000

23

24

25

COST VS (CAPABILITY, THROUGHPUT)

26

Cost 1M (3:1 IO) tokens per Composite Capability point
Is capability in line with cost?$0.020$0.034$0.036$0.036$0.369$0.037$0.036$0.037$0.037$0.324$0.066$0.038$0.004$0.011$0.022$0.346$0.022$0.308$1.103$0.071$0.012$0.002$0.073$0.003$0.093$0.024$0.003$0.050$0.044$0.011$0.341$0.012$0.080$0.009$0.003$0.011$0.073$0.084$0.013$0.088$0.005$0.006$0.004$0.065$0.003$0.030$0.003$0.023$0.090$0.042$0.067$0.091$0.002$0.067$0.002$0.006$0.006$0.004$0.244$0.017$0.060$0.069$0.011$0.053$0.133$0.001$0.005$0.056$0.020$0.563$0.003$0.027$0.005$0.102$0.003$0.003$0.032$0.751$0.018$0.024$0.003$0.004$0.014$0.002$0.080$0.004$0.018$0.082$0.003$0.021$0.149$0.008$0.075

27

Cost 1M (3:1 IO) tokens per AA Intelligence Index v2 point$0.028$0.050$0.049$0.051$0.493$0.053$0.051$0.051$0.056$0.469$0.098$0.055$0.005$0.016$0.029$0.517$0.423$0.105$0.016$0.003$0.113$0.004$0.150$0.031$0.005$0.077$0.066$0.016$0.016$0.125$0.013$0.005$0.016$0.097$0.118$0.020$0.136$0.007$0.008$0.006$0.107$0.004$0.044$0.004$0.035$0.140$0.062$0.107$0.003$0.109$0.003$0.009$0.006$0.023$0.085$0.015$0.081$0.197$0.002$0.007$0.081$0.857$0.004$0.038$0.007$0.004$0.005$0.046$0.026$0.036$0.004$0.003$0.121$0.121$0.011$0.107

28

Cost 1M (3:1 IO) tokens per AA Index v1 point$0.292$0.011$0.022$0.305$0.063$0.011$0.075$0.024$0.002$0.020$0.035$0.056$0.079$0.002$0.005$0.200$0.047$0.008$0.041$0.104$0.004$0.041$0.015$0.429$0.019$0.004$0.002$0.024$0.013$0.018$0.002$0.003$0.009$0.002$0.055$0.003$0.013$0.015$0.105$0.005$0.052

29

Cost 1M (3:1 IO) tokens per Chatbot Arena ELO point$0.002$0.002$0.002$0.002$0.002$0.001$0.021$0.001$0.019$0.066$0.004$0.001$0.000$0.004$0.005$0.001$0.000$0.003$0.003$0.019$0.001$0.004$0.001$0.000$0.001$0.004$0.005$0.001$0.004$0.000$0.000$0.003$0.000$0.001$0.002$0.003$0.005$0.000$0.003$0.000$0.000$0.012$0.003$0.003$0.000$0.000$0.000$0.002$0.001$0.024$0.000$0.001$0.000$0.004$0.000$0.001$0.030$0.001$0.000$0.001$0.000$0.003$0.000$0.001$0.000$0.001$0.005

30

Cost 1M (3:1 IO) tokens per LiveBench point$0.024$0.044$0.050$0.045$0.470$0.043$0.048$0.045$0.044$0.377$0.076$0.046$0.005$0.025$0.419$0.026$0.347$1.422$0.081$0.013$0.086$0.116$0.027$0.004$0.060$0.056$0.015$0.092$0.012$0.003$0.014$0.091$0.105$0.015$0.104$0.007$0.004$0.076$0.037$0.003$0.029$0.100$0.045$0.079$0.002$0.003$0.007$0.007$0.298$0.021$0.067$0.074$0.012$0.153$0.002$0.007$0.024$0.611$0.004$0.032$0.006$0.003$0.003$0.037$0.020$0.004$0.003$0.021$0.009

31

Cost 1M (3:1 IO) tokens per Throughput (first 1k tokens output)
Is throughput in line with cost?$0.092$0.268$0.147$0.212$0.114$0.265$0.119$0.114$1.336$0.301$0.021$0.154$0.106$1.336$0.106$0.521$1.481$0.349$0.020$0.049$0.358$0.017$1.351$0.118$0.039$0.028$0.177$0.093$2.104$0.058$0.349$0.053$0.007$0.043$0.609$0.203$0.029$0.321$0.019$0.034$0.017$0.287$0.049$0.152$0.018$0.116$0.249$0.096$0.240$0.360$0.022$0.755$0.006$0.010$0.009$0.019$0.455$0.070$0.211$0.234$0.059$0.151$0.437$0.002$0.011$0.156$0.010$0.830$0.030$0.101$0.016$0.366$0.018$0.024$0.101$1.061$0.088$0.077$0.019$0.003$0.041$0.008$0.152$0.003$0.047$0.227$0.019$0.110$0.088$0.024$0.190

32

33

34

BENCHMARKS

35

Scores are opportunistically sourced from HuggingFace leaderboards, model cards, docs & press releases, other 3rd parties.

Many reproducibility, precision, and accuracy issues due to selection bias & varied methodologies (n-shot, turns, CoT, benchmark variant…), susceptible to prompt sensitivity, construct validity, and contamination.

Models used by HuggingFace Leaderboard 2 (2024 June 26) are highlighted in GREEN.

Models used by HuggingFace Leaderboard 1 are highlighted in ORANGE. These benchmarks are increasingly outdated for various reasons, eg benchmark inaccuracies or they are effectively "solved" by frontier models (due to training to beat the benchmark, leaked criteria, or outdated level of difficulty).

36

General

37

MMLU PRO
HuggingFace Leaderboard 2
Scores per category (Biology, Business, Engineering, …)84.084.084.079.077.076.081.080.076.274.076.078.067.080.077.675.876.072.676.171.671.086.067.063.773.368.970.085.045.370.468.068.569.063.169.066.867.365.064.866.467.066.363.652.653.553.659.156.253.058.1

38

MMLU
HuggingFace Leaderboard 1
Most widely reported, but many known issues.
86.992.091.085.990.885.287.089.088.086.087.088.788.785.078.686.488.686.074.584.884.086.885.982.081.980.675.081.086.486.086.082.078.583.880.581.275.278.982.083.283.8

39

Humanity's Last Exam
Technical knowledge and reasoning via structured academic problems18.812.18.35108.5711.1546.6757.24.054.8554.895.152.624.43553.93.9750.845.53

40

HELM90.893.888.590.885.880.372.288.570.179.391.583.882.770.874.273.353.0

41

Kagi LLM Benchmarking
Omitted - scores highly variable over time

42

SimpleBench38.751.653.122.840.134.546.430.941.731.344.930.718.118.941.427.718.927.117.827.525.119.922.719.922.523.510.7

43

TriviaQA85.578.2

44

Arena Hard79.279.378.084.075.473.260.489.772.037.987.362.346.957.549.646.663.7

45

SimpleQA52.947.063.044.329.938.021.715.010.4

46

Instruction Following

47

IFEval
HuggingFace Leaderboard 283.990.885.688.088.692.163.092.185.084.387.582.979.989.780.7

48

Math

49

MATH
Multiple difficulty levels, inconsistently reported.
HuggingFace Leaderboard 2 uses Level 5 (most difficult)97.995.095.097.097.394.094.896.091.882.290.090.085.078.389.092.090.986.583.075.971.186.882.085.088.073.473.877.073.043.380.460.176.670.267.771.077.969.264.568.068.070.635.173.353.554.950.462.4

50

GSM8K
HuggingFace Leaderboard 196.196.482.696.871.093.095.094.890.894.295.185.394.587.085.492.6

51

HiddenMath87.383.379.656.765.263.663.552.055.328.047.220.3

52

AIME 202586.7

53

AIME 202436.749.052.079.523.392.016.051.09.39.610.025.09.05.3

54

MathVista73.268.163.869.456.763.968.965.857.358.4

55

Reasoning

56

GPQA
HuggingFace Leaderboard 284.079.777.071.477.070.075.062.078.059.064.768.060.056.065.053.061.060.159.153.053.159.460.153.049.043.049.151.150.551.043.025.456.147.050.446.940.246.044.451.041.641.446.746.745.316.342.036.934.441.439.543.5

57

BIG-BENCH-HARD
HuggingFace Leaderboard 293.177.772.686.886.984.083.157.582.485.5

58

MuSR
HuggingFace Leaderboard 217.319.7

59

ARC Challenge
HuggingFace Leaderboard 196.796.770.696.968.994.896.494.868.892.493.071.4

60

HellaSwag
HuggingFace Leaderboard 185.684.295.495.387.385.7

61

TruthfulQA (MC2)
HuggingFace Leaderboard 154.761.8

62

WinoGrande
HuggingFace Leaderboard 178.874.384.582.9

63

ARC Easy

64

DROP79.887.177.285.472.283.179.780.9

65

Social IQA

66

MedQA

67

Code

68

Aider polyglot72.072.979.649.360.461.744.964.956.945.353.847.155.152.420.938.260.432.418.232.948.451.615.618.222.121.827.112.017.88.93.628.0

69

HumanEval97.098.098.097.092.092.498.092.092.492.093.795.097.090.056.593.090.292.090.082.088.089.087.689.088.487.045.782.692.084.988.087.271.988.488.186.680.584.864.684.073.281.776.882.3

70

HumanEval Plus82.887.0

71

LiveCodeBench v570.436.034.528.9

72

MBPP Base65.660.480.0

73

MBPP EvalPlus90.588.687.769.083.686.0

74

Natural2Code92.985.482.679.877.2

75

Tool use

76

BFCL v359.357.556.562.257.461.956.852.160.555.157.953.753.755.951.5

77

BFCL v277.281.177.377.5

78

BFCL v1 (Berkeley Function Calling)66.480.580.288.586.388.384.885.5

79

Nexus56.145.758.750.356.7

80

Agentic

81

SWE-bench Verified49.033.433.4

82

Conversational

83

MT Bench93.274.086.383.383.578.692.6

84

Multimodal (Image)

85

MMMU81.774.478.272.771.875.470.472.765.968.369.168.056.164.050.359.459.462.262.862.356.156.3

86

Multilingual

87

MGSM90.591.674.388.691.691.164.380.690.787.087.586.586.986.975.582.6

88

Long Context

89

ZeroSCROLLS/QuALITY90.590.595.295.290.5

90

InfiniteBench/En.MC82.583.472.178.2

91

NIH/Multi-needle100.090.898.1100.097.5

92

MRCR (1M)74.770.558

93

MISC

94

Reasoning Tokens

95

Licenseproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenproprietaryopenopenopenproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryproprietaryopenopenopenopenproprietaryproprietaryopenproprietaryopenopenproprietaryopenopenopenopenproprietaryproprietaryproprietaryproprietaryproprietaryproprietaryopenproprietaryproprietaryproprietaryopenopenopenproprietaryopenopenproprietaryopenopenopenopenproprietaryopenproprietaryopenproprietary

96

OrganizationOpenAIGoogleGoogleOpenAIOpenAIGoogleOpenAIAnthropicAnthropicxAiGoogleOpenAIAnthropicOpenAIOpenAIOpenAIAnthropicDeepSeekGoogleAnthropicAlibabaOpenAIOpenAIGoogleDeepSeekOpenAIGoogleOpenAIAlibabaGoogleAnthropicOpenAIGoogleAlibabaMistralOpenAIxAIDeepSeekAnthropicRekaMetaAlibabaOpenAIGoogleDeepSeekGoogleGooglePerplexityAlibabaMicrosoftOpenAIAnthropicGoogleCohereDeepSeekAlibabaGoogleMetaOpenAIPerplexityMetaxAIMetaMistralxAIGoogleMicrosoftMistralDeepSeekAnthropicOpenAIAmazonOpenAITencentGoogleMistralGoogleAnthropicOpenAIMetaMetaMistralTencentDeepSeekAlibabaAmazonAI21DeepSeekGoogleAI21GoogleMetaRekaMistral01.AI

97

Search

98

99

100