Predicting LLM Parameters Using Benchmarks

7 min read Original article ↗

In a recent episode of the Latent Space podcast (Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith), the Artificial Analysis team pointed out a strong correlation between model performance on their AA-Omniscience Accuracy benchmark and the model’s parameter count:

An interesting thing about this accuracy metric is that it tracks more closely than anything else that we measure the total parameter count of models… If you draw the line on AA-Omniscience accuracy vs total parameters, … you can see that likely the leading front-end models right now are quite a lot bigger than the 1 trillion parameters that the open-weights models cap out at… There’s an interesting extra data point that Elon Musk revealed recently about xAI that Grok 3 and 4 - 3 trillion parameters for Grok 3 and 4, 6 trillion for Grok 5 (but that’s not out yet).

Well, that caught my attention, so I had to see for myself.

AA-Omniscience Accuracy appears strongly correlated to (log) model parameter count

AA-Omniscience Accuracy appears strongly correlated to (log) model parameter count

Experiment

Data Collection

My objective is to see if the Omniscience Accuracy benchmark (or other benchmarks) are predictive of parameter counts. If so, I’ll use the benchmark scores to interpolate some projected scores for models like GPT-5.2, Gemini 3 Pro, and Claude Sonnet and Opus 4.5. To that end, I collected data from Artificial Analysis. I also collected information about model sizes from HuggingFace, release announcements, or technical papers on arxiv.org, and pricing information from Artificial Analysis and Simon Wilison’s LLM Prices project.

AA-Omniscience Accuracy

Modeling

I came up with a number of questions as I was investigating:

  1. Can we predict model size based on benchmarks (and if so, which benchmark)?
  2. It strikes me that token prices may hint at model sizes (cost to host the model), so is pricing information predictive of model size?
  3. Most recent frontier models use a mixture-of-experts architecture; does including sparsity information make the model size prediction more accurate?

I fit 15 different linear regressions across 5 benchmarks: Omniscience Accuracy 1, MMLU Pro 2, Artificial Analysis’ Intelligence Index 3, Tau² 4, and GDPVal 5. Each benchmark was used by itself as a standalone predictor, and in conjunction with pricing or sparsity. Academic papers tend to define sparsity based on the ratio of inactive experts to total experts 6 7. However, model labs tend to share the number of total vs active parameters, and may not disclose the number of experts or expert architectures. Therefore, I modeled (roughly) the inverse of sparsity as the ratio of active:total parameters per token.

Note

  • Omniscience is Artificial Analysis’ own benchmark which rewards precise knowledge and penalizes hallucinated responses. Omniscience Accuracy is the “correctness” component of Omniscience, measuring the proportion of correctly answered questions out of all questions, regardless of whether the model chooses to answer. Omniscience also tracks Hallucination Rate (how often the model answers incorrectly when it should have refused or admitted to not knowing the answer) and Attempt Rate.
  • MMLU Pro an enhanced approach to the Massive Multitask Language Understanding (MMLU) benchmark designed to evaluate language understanding, integrating more challenging, reasoning-focused questions and a greater spread of possible response options.
  • Artificial Analysis Intelligence Index is a composition of a composition of 10 different benchmarks, used to compare LLM capabilities across a broad range of use cases. Developed by Artificial Analysis, they run their own independent tests so model performance can be compared apples-to-apples.
  • Tau² measures an LLM’s capability to drive agentic decisions over a variety of real-world scenarios. Tau² simulates user and AI Agent interactions over domain-specific tasks and evaluates success on these tasks.
  • GDPVal is an OpenAI-developed benchmark designed to track how well AI models perform on economically valuable, real-world tasks. Tasks were developed and evaluated in partnership with human experts.

Metrics

A three-panel summary chart comparing how well different benchmark metrics predict total model parameters. The top panel shows mean R² values. Metrics based on omniscience_accuracy and mmlu_pro have the highest R² (around 0.75 to 0.84), while intelligenceIndex variants are near zero, and some metrics have negative R², indicating poor fits. The bottom two panes show mean absolute error (MAE) and root mean squared error (RMSE), with similar patterns: very large errors for intelligenceIndex+price and lower errors for mmlu_pro and omniscience_accuracy metrics.

As mentioned in that podcast episode, Omniscience Accuracy indeed is the most predictive (R²=0.84), followed by MMLU Pro (R²=0.75) and trailed by the Intelligence Index (R²=0.07). (As a reminder, R² is a measure of how much total parameter variance the predictor(s) account for - a “goodness of fit” metric.) Their errors (mean absolute error (MAE) and root mean squared error (RMSE)) are generally around 200B total parameters - this is not a precise estimator!

Adding pricing information made the regression fit worse in every case; adding pricing information to Intelligence Index in particular caused prediction error to explode. Active token ratio had no apparent effect on model predictivity.

Tau² and GDPVal have negative R² values, indicating that the benchmarks are not predictive at all. I found it interesting that benchmarks that test knowledge had the best fit, while benchmarks that test task performance (Tau², GDPVal) were not predictive at all. This hints at parameter counts being innately tied to model knowledge capacity, while task performance is something that can be improved in post-training 8 9.

Open for metrics table

Predictions

Given these metrics, we can use Omniscience Accuracy (or MMLU Pro or Intelligence Index) to estimate the size of proprietary models (GPT-5.x, Gemini, Claude Sonnet/Opus).

Predictions

Interestingly, while Omniscience Accuracy has the best fit metrics (R², MAE, RMSE), it seems to have the least realistic predictions for the set of proprietary models. The Omniscience Accuracy regression suggests that Gemini 3 Pro (preview) has a total parameter count of 1,254T (yes, T!) tokens, with GPT-5.2 at 43T and Claude Opus 4.5 at 22T. I find this to be completely unrealistic; I think it would be quite infeasible to effectively serve models that size. Further, although pricing was not predictive in the regression, I find it hard to believe that Gemini 3 Pro could be priced competitively were it that large.

When colored by my expectations, the Intelligence Index regression seems to provide the most realistic predictions - Gemini 3 Pro (preview) at 3.4T, Claude 4.5 Sonnet at 1.4T, and Claude 4.5 Opus at 4.1T. It predicts different parameter counts for GPT-5.1 and GPT-5.2, although I assume they use the same architecture, placing the GPT-5.x series between 2.9-5.3T total parameters. Using this model, GPT-5 mini is estimated at 1T total parameters, GPT-5 nano at 100B, and Claude 4.5 Haiku at 520B. All of these are roughly in line with my personal “vibe checks” on model capability, especially as code assistants.

Open for full parameters table

Finally, I did do some experiments with the active token ratio as an additional predictive feature. As the metrics suggest, active token ratio has no bearing on predicted total parameters.

Conclusion

In the end, does estimating total parameter count matter? Not really; it’s merely one factor among many that may contribute to model performance. This is quite evident in the lack of predictivity that task capability benchmarks like Tau² and GDPVal have on model size. That said, I find it useful to understand the relationships between model sizes and architectures and their performance metrics. And I had fun trying to pierce the proprietary labs’ veil of secrecy!

As Swyx said:

What does it really matter? As long as they can serve it at a sustainable cost, that’s about it.


Note

Disclosure of AI Assistance

I used ChatGPT web search and Deep Research to help with the data gathering for this experiment, especially when identifying model specs and associated sources. I manually validated every spec and source.
I used coding agents (GitHub Copilot, OpenAI Codex) to accelerate scraping benchmark and pricing data, and to speed up defining the plots. I also used them for code review. I reviewed and revised all AI-generated code.
None of the blog post itself was drafted or generated by AI tools.


References

Cite this article

If you would like to reference this article, please consider citing it as:

Graber, A. H. (2026, Jan 19). Predicting LLM Parameters Using Benchmarks. AI/MLbling-About. https://aimlbling-about.ninerealmlabs.com/blog/benchmarks-predict-model-size/

Or with BibTeX:

@online{graber2026_benchmarkspredictmodelsize,
  author = {Graber, },
  title = {Predicting LLM Parameters Using Benchmarks},
  year = {2026},
  date = {2026-01-19},
  url = {https://aimlbling-about.ninerealmlabs.com/blog/benchmarks-predict-model-size/},
  urldate = {2026-01-30},
  note = {Blog post}
}