What Mathematicians Actually Ask: Question Types
Using an LLM to analyze and classify the questions submitted to Sugaku, we've identified that mathematicians primarily seek help with searching for references and asking about specific applications to non-math areas.
| Question Type | % |
|---|---|
| Find Relevant Resources | 24 |
| Application to X | 14 |
| Off-Topic / Not Math | 12 |
| Ask about Specific Person | 10 |
| Explain Concept/Definition | 9 |
| Calculate/Compute | 8 |
| Ask about Specific Paper | 7 |
| Proof Assistance | 6 |
| Problem Solving | 6 |
| Research Suggestion/Direction | 2 |
| Teaching Advice | 2 |
| Website support | 1 |
The Evaluation Paradox: Human Judgment vs. AI Judgment
To evaluate system performance, we've gathered feedback through two methods: direct user votes and LLM-based evaluations. The results reveal a striking disconnect.
Human Evaluation
When mathematicians vote on answer quality (+1 or -1), the rankings show Sugaku-MA1 in the lead, followed closely by DeepSeek R1 and O1.
Note: This is NOT a fair comparison since some of these questions were used to help train and calibrate Sugaku. Results updated as more data comes in.
| Model | Votes |
|---|---|
| sugaku-ma1 | 0.48 |
| deepseek-r1 | 0.44 |
| o1 | 0.43 |
| o3-mini | 0.38 |
| gemini-2-pro | 0.36 |
| o1-mini | 0.35 |
| claude-37-sonnet | 0.26 |
| gpt-4o | 0.22 |
| gpt-4_5 | 0.19 |
| sky-t1 | 0.14 |
| gemini-2-flash | 0.08 |
| claude-3-5-haiku | -0.19 |
| claude-3-5-sonnet | -0.19 |
| claude-3-opus | -0.26 |
LLM Evaluation
When we used AI judges to evaluate the same answers, we discovered that each AI system consistently preferred answers from its own brand of models, regardless of actual mathematical correctness. Most tellingly, in cases where all AI systems provided incorrect answers, LLM judges still expressed high confidence in the responses from systems similar to themselves, while human experts correctly identified all answers as flawed.
Aside from this, O1 ranks consistently high and Sugaku ranks consistently low.
Gemini as a Judge
Gemini likes Gemini 2 Pro by far:
| Model | Accuracy | Relevance | Completeness | Overall |
|---|---|---|---|---|
| gemini-2-pro | 9.1 | 9.4 | 9.1 | 9.1 |
| gpt-4_5 | 8.9 | 9.4 | 8.3 | 8.6 |
| o1 | 8.9 | 9.3 | 8.2 | 8.5 |
| deepseek-r1 | 8.7 | 9.2 | 8.4 | 8.5 |
| gemini-2-flash | 8.6 | 9.1 | 8.3 | 8.4 |
| o1-mini | 8.5 | 9.0 | 8.1 | 8.3 |
| o3-mini | 8.3 | 8.8 | 7.3 | 7.8 |
| gpt-4o | 8.1 | 8.7 | 7.0 | 7.6 |
| claude-3-5-sonnet | 8.0 | 8.6 | 7.0 | 7.4 |
| sky-t1 | 7.8 | 8.4 | 6.9 | 7.2 |
| claude-37-sonnet | 7.9 | 8.5 | 6.5 | 7.2 |
| claude-3-opus | 7.6 | 8.2 | 6.6 | 7.0 |
GPT as a Judge
GPT-4o likes GPT 4.5 followed by GPT-4o and O1:
| Model | Accuracy | Relevance | Understandability | Succinctness | Overall |
|---|---|---|---|---|---|
| gpt-4_5 | 8.3 | 8.4 | 8.2 | 7.7 | 8.2 |
| gpt-4o | 8.0 | 8.0 | 8.4 | 8.1 | 8.1 |
| o1 | 8.2 | 8.3 | 8.2 | 7.7 | 8.1 |
| gemini-2-pro | 8.5 | 8.6 | 8.0 | 7.1 | 8.0 |
| o1-mini | 8.0 | 8.0 | 8.1 | 7.6 | 7.9 |
| claude-37-sonnet | 7.6 | 7.8 | 8.2 | 8.0 | 7.9 |
| claude-3-5-sonnet | 7.6 | 7.7 | 8.2 | 7.9 | 7.8 |
| o3-mini | 7.7 | 7.8 | 8.1 | 7.7 | 7.8 |
| gemini-2-flash | 7.9 | 8.1 | 7.9 | 7.2 | 7.8 |
| claude-3-opus | 7.5 | 7.6 | 8.0 | 7.7 | 7.7 |
| claude-3-5-haiku | 7.1 | 7.3 | 8.0 | 8.0 | 7.6 |
| sky-t1 | 7.2 | 7.4 | 7.9 | 7.6 | 7.5 |
| deepseek-r1 | 7.8 | 7.9 | 7.3 | 6.2 | 7.3 |
| sugaku-ma1 | 6.6 | 6.6 | 7.0 | 6.6 | 6.7 |
Claude as a Judge
Claude likes O1 and O3-mini followed by Claude 3.7 Sonnet and 3.5 Sonnet:
| Model | Accuracy | Relevance | Understandability | Succinctness | Overall |
|---|---|---|---|---|---|
| o1 | 9.0 | 9.2 | 8.9 | 8.0 | 8.8 |
| o3-mini | 8.3 | 8.6 | 8.8 | 8.3 | 8.5 |
| claude-37-sonnet | 7.8 | 8.3 | 8.7 | 9.1 | 8.5 |
| claude-3-5-sonnet | 7.7 | 8.3 | 8.7 | 9.1 | 8.4 |
| gpt-4_5 | 8.3 | 8.9 | 8.4 | 7.9 | 8.4 |
| gpt-4o | 7.7 | 8.2 | 8.7 | 8.7 | 8.3 |
| gemini-2-pro | 8.8 | 9.1 | 8.8 | 6.3 | 8.2 |
| deepseek-r1 | 8.3 | 8.8 | 8.5 | 7.4 | 8.2 |
| o1-mini | 8.3 | 8.7 | 8.7 | 6.8 | 8.1 |
| gemini-2-flash | 8.1 | 8.6 | 8.6 | 6.8 | 8.0 |
| claude-3-5-haiku | 6.9 | 7.7 | 8.3 | 9.2 | 8.0 |
| claude-3-opus | 7.1 | 7.7 | 8.4 | 8.7 | 8.0 |
| sky-t1 | 7.0 | 7.6 | 8.1 | 7.7 | 7.6 |
| sugaku-ma1 | 5.7 | 5.9 | 6.9 | 7.2 | 6.4 |
Key Takeaways
The disconnect between what AI benchmarks measure and what mathematicians actually need is significant. Standard benchmarks focus heavily on problem-solving and computation, but real mathematical research requires:
- Literature navigation: Finding relevant papers across a vast and growing body of work
- Interdisciplinary connections: Understanding how concepts apply across different mathematical domains
- Contextual understanding: Knowing not just what is true, but why it matters and how it fits into the broader landscape
- Research guidance: Helping identify promising directions and potential pitfalls
Perhaps most importantly, our evaluation reveals that using AI to judge AI responses introduces systematic biases that human evaluators don't share. This has significant implications for how we develop and evaluate AI systems for mathematical applications.