Simple Tama Agentic Workflow - Q1 2026 | Upmaru

4 min read Original article ↗

Upmaru

Upmaru

← Back to LLM Tests

Baseline workflow that tests a model, or an ensemble of models, on its ability to navigate an agentic system. These tests cover tasks such as classification, function calling, intent detection, and other workflow behaviors to evaluate overall performance quality.

Agentic systems favor smaller, faster models. We escalate to larger models or deeper reasoning only when a lower-parameter or lower-reasoning configuration fails and there is enough latency headroom.

RSS Feed

Setup

The setup for the agent tests the following capability of the AI agents:

  • Routing - Can the model route the user message to the correct execution path?
  • Tool Use - Does it call tools accurately with valid structured arguments?
  • Instruction Following - Can it follow instructions in the system prompt correctly?
  • Resolving Constraints - Does it understand how to resolve constraints provided in the user's query?
  • Multi-turn Conversation - Can the model handle multi-turn conversation and still do all of the above correctly?

If the model fails to do any of the above, the workflow will fail. We will dig in to see what the model outputs to find the gaps. We will also compare our results against the results on Artificial Analysis to verify whether the scores reflect the capabilities outlined in these tests.

Given the simplicity of this test, models benchmarked with high scores should be able to pass it without any issues.

Architecture

The following is the configured architecture of the workflow that the LLM will follow:

LLM Tasks

These tasks correspond to the numbered LLM nodes in the architecture diagram.

Name Description
1 routing Classify the user request and route it to the correct workflow path.
2 query-database Generate a valid database query from the request and available context.
3 routing Route the chain output to the response stage that will produce user-facing output.
4 create-artifact Build a structured artifact from results so the UI can render a useful view.
5 stream-response Compose and stream the final response to the user with the generated artifact.

Results

Summary

Key findings and results summary in a single rankings table.

Output Score Guide

Rating Range Remark
9.1 - 10.0 Likely works well for multi-turn agentic chat.
8.1 - 9.0 Can partially work, or be made to work well with some prompt engineering.
0.0 - 8.0 Should not be used for agentic systems.

Latency Feel Guide

Rating Range Remark
8.1 - 10.0 Feels Fast and pleasant to use.
6.1 - 8.0 Acceptable, but slightly slow.
0.0 - 6.0 Almost unusable.
Model Test

Overall Score

Output Score

Inference Provider

Latency Score

Turns Scored Quality
1 Mistral Suite icon Mistral Suite 9.7 9.7 Mistral 9.7 3 output, 3 latency

97%

2 Minimax 2.5 icon Minimax 2.5 4.1 5.3 Fireworks AI 1.3 2 output, 2 latency

41%

3 Gemini 3 icon Gemini 3 9.2 10.0 Google AI Studio 7.3 3 output, 3 latency

92%

4 Gemini 3.1 Flash Lite icon Gemini 3.1 9.6 10.0 Google AI Studio 8.7 3 output, 3 latency

96%

5 Z.ai logo GLM 5 8.7 10.0 Fireworks AI 5.7 3 output, 3 latency

87%

6 Kimi icon Kimi K2.5 7.7 8.7 Fireworks AI 5.3 3 output, 3 latency

77%

7 Grok icon Grok 4.1 Fast 3.4 3.3 xAI 3.7 2 output, 2 latency

34%

8 Grok icon Grok Code Fast 1 Suite 8.5 10.0 xAI 5.0 3 output, 3 latency

85%

9 GPT OSS 120b icon GPT OSS 120b 6.5 6.7 Groq 6.0 2 output, 2 latency

65%

10 GPT 5 Mini icon GPT 5 Mini 8.9 10.0 OpenAI 6.3 3 output, 3 latency

89%

11 NVIDIA icon Nemotron 3 Super 120b 2.2 2.0 Nebius 2.7 1 output, 1 latency

22%

12 Qwen icon Qwen 3.5 7.8 10.0 Together AI 2.7 3 output, 3 latency

78%

13 DeepSeek icon DeepSeek 3.2 4.9 6.0 Fireworks AI 2.3 2 output, 2 latency

49%

14 Mistral Small 4 icon Mistral Small 4 9.4 9.3 Mistral 9.7 3 output, 3 latency

94%

Stay notified when new LLM test results are published.