The Disconnect Between AI Benchmarks and Math Research

3 min read Original article ↗

What Mathematicians Actually Ask: Question Types

Using an LLM to analyze and classify the questions submitted to Sugaku, we've identified that mathematicians primarily seek help with searching for references and asking about specific applications to non-math areas.

Question Type %
Find Relevant Resources24
Application to X14
Off-Topic / Not Math12
Ask about Specific Person10
Explain Concept/Definition9
Calculate/Compute8
Ask about Specific Paper7
Proof Assistance6
Problem Solving6
Research Suggestion/Direction2
Teaching Advice2
Website support1

The Evaluation Paradox: Human Judgment vs. AI Judgment

To evaluate system performance, we've gathered feedback through two methods: direct user votes and LLM-based evaluations. The results reveal a striking disconnect.

Human Evaluation

When mathematicians vote on answer quality (+1 or -1), the rankings show Sugaku-MA1 in the lead, followed closely by DeepSeek R1 and O1.

Note: This is NOT a fair comparison since some of these questions were used to help train and calibrate Sugaku. Results updated as more data comes in.

Model Votes
sugaku-ma10.48
deepseek-r10.44
o10.43
o3-mini0.38
gemini-2-pro0.36
o1-mini0.35
claude-37-sonnet0.26
gpt-4o0.22
gpt-4_50.19
sky-t10.14
gemini-2-flash0.08
claude-3-5-haiku-0.19
claude-3-5-sonnet-0.19
claude-3-opus-0.26

LLM Evaluation

When we used AI judges to evaluate the same answers, we discovered that each AI system consistently preferred answers from its own brand of models, regardless of actual mathematical correctness. Most tellingly, in cases where all AI systems provided incorrect answers, LLM judges still expressed high confidence in the responses from systems similar to themselves, while human experts correctly identified all answers as flawed.

Aside from this, O1 ranks consistently high and Sugaku ranks consistently low.

Gemini as a Judge

Gemini likes Gemini 2 Pro by far:

Model Accuracy Relevance Completeness Overall
gemini-2-pro9.19.49.19.1
gpt-4_58.99.48.38.6
o18.99.38.28.5
deepseek-r18.79.28.48.5
gemini-2-flash8.69.18.38.4
o1-mini8.59.08.18.3
o3-mini8.38.87.37.8
gpt-4o8.18.77.07.6
claude-3-5-sonnet8.08.67.07.4
sky-t17.88.46.97.2
claude-37-sonnet7.98.56.57.2
claude-3-opus7.68.26.67.0

GPT as a Judge

GPT-4o likes GPT 4.5 followed by GPT-4o and O1:

Model Accuracy Relevance Understandability Succinctness Overall
gpt-4_58.38.48.27.78.2
gpt-4o8.08.08.48.18.1
o18.28.38.27.78.1
gemini-2-pro8.58.68.07.18.0
o1-mini8.08.08.17.67.9
claude-37-sonnet7.67.88.28.07.9
claude-3-5-sonnet7.67.78.27.97.8
o3-mini7.77.88.17.77.8
gemini-2-flash7.98.17.97.27.8
claude-3-opus7.57.68.07.77.7
claude-3-5-haiku7.17.38.08.07.6
sky-t17.27.47.97.67.5
deepseek-r17.87.97.36.27.3
sugaku-ma16.66.67.06.66.7

Claude as a Judge

Claude likes O1 and O3-mini followed by Claude 3.7 Sonnet and 3.5 Sonnet:

Model Accuracy Relevance Understandability Succinctness Overall
o19.09.28.98.08.8
o3-mini8.38.68.88.38.5
claude-37-sonnet7.88.38.79.18.5
claude-3-5-sonnet7.78.38.79.18.4
gpt-4_58.38.98.47.98.4
gpt-4o7.78.28.78.78.3
gemini-2-pro8.89.18.86.38.2
deepseek-r18.38.88.57.48.2
o1-mini8.38.78.76.88.1
gemini-2-flash8.18.68.66.88.0
claude-3-5-haiku6.97.78.39.28.0
claude-3-opus7.17.78.48.78.0
sky-t17.07.68.17.77.6
sugaku-ma15.75.96.97.26.4

Key Takeaways

The disconnect between what AI benchmarks measure and what mathematicians actually need is significant. Standard benchmarks focus heavily on problem-solving and computation, but real mathematical research requires:

  • Literature navigation: Finding relevant papers across a vast and growing body of work
  • Interdisciplinary connections: Understanding how concepts apply across different mathematical domains
  • Contextual understanding: Knowing not just what is true, but why it matters and how it fits into the broader landscape
  • Research guidance: Helping identify promising directions and potential pitfalls

Perhaps most importantly, our evaluation reveals that using AI to judge AI responses introduces systematic biases that human evaluators don't share. This has significant implications for how we develop and evaluate AI systems for mathematical applications.