Baseline workflow that tests a model, or an ensemble of models, on its
ability to navigate an agentic system. These tests cover tasks such as
classification, function calling, intent detection, and other workflow behaviors to evaluate
overall performance quality.
Agentic systems favor smaller, faster models. We escalate to larger models or deeper reasoning only when a lower-parameter or lower-reasoning configuration fails and there is enough latency headroom.
Setup
The setup for the agent tests the following capability of the AI agents:
- Routing - Can the model route the user message to the correct execution path?
- Tool Use - Does it call tools accurately with valid structured arguments?
- Instruction Following - Can it follow instructions in the system prompt correctly?
- Resolving Constraints - Does it understand how to resolve constraints provided in the user's query?
- Multi-turn Conversation - Can the model handle multi-turn conversation and still do all of the above correctly?
If the model fails to do any of the above, the workflow will fail. We will dig in to see what the model outputs to find the gaps. We will also compare our results against the results on Artificial Analysis to verify whether the scores reflect the capabilities outlined in these tests.
Given the simplicity of this test, models benchmarked with high scores should be able to pass it without any issues.
Architecture
The following is the configured architecture of the workflow that the LLM will follow:
LLM Tasks
These tasks correspond to the numbered LLM nodes in the architecture diagram.
| Name | Description | |
|---|---|---|
| 1 | routing | Classify the user request and route it to the correct workflow path. |
| 2 | query-database | Generate a valid database query from the request and available context. |
| 3 | routing | Route the chain output to the response stage that will produce user-facing output. |
| 4 | create-artifact | Build a structured artifact from results so the UI can render a useful view. |
| 5 | stream-response | Compose and stream the final response to the user with the generated artifact. |
Results
Summary
Key findings and results summary in a single rankings table.
Output Score Guide
| Rating | Range | Remark |
|---|---|---|
9.1 - 10.0 | Likely works well for multi-turn agentic chat. | |
8.1 - 9.0 | Can partially work, or be made to work well with some prompt engineering. | |
0.0 - 8.0 | Should not be used for agentic systems. |
Latency Feel Guide
| Rating | Range | Remark |
|---|---|---|
8.1 - 10.0 | Feels Fast and pleasant to use. | |
6.1 - 8.0 | Acceptable, but slightly slow. | |
0.0 - 6.0 | Almost unusable. |
| Model Test | Overall Score | Output Score | Inference Provider | Latency Score | Turns Scored | Quality | ||
|---|---|---|---|---|---|---|---|---|
| 1 | | Mistral Suite | 9.7 | 9.7 | Mistral | 9.7 | 3 output, 3 latency | 97% |
| 2 | | Minimax 2.5 | 4.1 | 5.3 | Fireworks AI | 1.3 | 2 output, 2 latency | 41% |
| 3 | | Gemini 3 | 9.2 | 10.0 | Google AI Studio | 7.3 | 3 output, 3 latency | 92% |
| 4 | | Gemini 3.1 | 9.6 | 10.0 | Google AI Studio | 8.7 | 3 output, 3 latency | 96% |
| 5 | | GLM 5 | 8.7 | 10.0 | Fireworks AI | 5.7 | 3 output, 3 latency | 87% |
| 6 | | Kimi K2.5 | 7.7 | 8.7 | Fireworks AI | 5.3 | 3 output, 3 latency | 77% |
| 7 | | Grok 4.1 Fast | 3.4 | 3.3 | xAI | 3.7 | 2 output, 2 latency | 34% |
| 8 | | Grok Code Fast 1 Suite | 8.5 | 10.0 | xAI | 5.0 | 3 output, 3 latency | 85% |
| 9 | | GPT OSS 120b | 6.5 | 6.7 | Groq | 6.0 | 2 output, 2 latency | 65% |
| 10 | | GPT 5 Mini | 8.9 | 10.0 | OpenAI | 6.3 | 3 output, 3 latency | 89% |
| 11 | | Nemotron 3 Super 120b | 2.2 | 2.0 | Nebius | 2.7 | 1 output, 1 latency | 22% |
| 12 | | Qwen 3.5 | 7.8 | 10.0 | Together AI | 2.7 | 3 output, 3 latency | 78% |
| 13 | | DeepSeek 3.2 | 4.9 | 6.0 | Fireworks AI | 2.3 | 2 output, 2 latency | 49% |
| 14 | | Mistral Small 4 | 9.4 | 9.3 | Mistral | 9.7 | 3 output, 3 latency | 94% |