Simple Tama Agentic Workflow - Q1 2026

Upmaru

Baseline workflow that tests a model, or an ensemble of models, on its ability to navigate an agentic system. These tests cover tasks such as classification, function calling, intent detection, and other workflow behaviors to evaluate overall performance quality.

Agentic systems favor smaller, faster models. We escalate to larger models or deeper reasoning only when a lower-parameter or lower-reasoning configuration fails and there is enough latency headroom.

RSS Feed

Setup

The setup for the agent tests the following capability of the AI agents:

Routing - Can the model route the user message to the correct execution path?
Tool Use - Does it call tools accurately with valid structured arguments?
Instruction Following - Can it follow instructions in the system prompt correctly?
Resolving Constraints - Does it understand how to resolve constraints provided in the user's query?
Multi-turn Conversation - Can the model handle multi-turn conversation and still do all of the above correctly?

If the model fails to do any of the above, the workflow will fail. We will dig in to see what the model outputs to find the gaps. We will also compare our results against the results on Artificial Analysis to verify whether the scores reflect the capabilities outlined in these tests.

Given the simplicity of this test, models benchmarked with high scores should be able to pass it without any issues.

Architecture

The following is the configured architecture of the workflow that the LLM will follow:

LLM Tasks

These tasks correspond to the numbered LLM nodes in the architecture diagram.

	Name	Description
1	routing	Classify the user request and route it to the correct workflow path.
2	query-database	Generate a valid database query from the request and available context.
3	routing	Route the chain output to the response stage that will produce user-facing output.
4	create-artifact	Build a structured artifact from results so the UI can render a useful view.
5	stream-response	Compose and stream the final response to the user with the generated artifact.

Results

Summary

Key findings and results summary in a single rankings table.

Output Score Guide

Rating	Range	Remark
	`9.1 - 10.0`	Likely works well for multi-turn agentic chat.
	`8.1 - 9.0`	Can partially work, or be made to work well with some prompt engineering.
	`0.0 - 8.0`	Should not be used for agentic systems.

Latency Feel Guide

Rating	Range	Remark
	`8.1 - 10.0`	Feels Fast and pleasant to use.
	`6.1 - 8.0`	Acceptable, but slightly slow.
	`0.0 - 6.0`	Almost unusable.

	Model Test	Overall Score	Output Score	Inference Provider	Latency Score	Turns Scored	Quality
1	Mistral Suite	9.7	9.7	Mistral	9.7	3 output, 3 latency	97%
2	Minimax 2.5	4.1	5.3	Fireworks AI	1.3	2 output, 2 latency	41%
3	Gemini 3	9.2	10.0	Google AI Studio	7.3	3 output, 3 latency	92%
4	Gemini 3.1	9.6	10.0	Google AI Studio	8.7	3 output, 3 latency	96%
5	GLM 5	8.7	10.0	Fireworks AI	5.7	3 output, 3 latency	87%
6	Kimi K2.5	7.7	8.7	Fireworks AI	5.3	3 output, 3 latency	77%
7	Grok 4.1 Fast	3.4	3.3	xAI	3.7	2 output, 2 latency	34%
8	Grok Code Fast 1 Suite	8.5	10.0	xAI	5.0	3 output, 3 latency	85%
9	GPT OSS 120b	6.5	6.7	Groq	6.0	2 output, 2 latency	65%
10	GPT 5 Mini	8.9	10.0	OpenAI	6.3	3 output, 3 latency	89%
11	Nemotron 3 Super 120b	2.2	2.0	Nebius	2.7	1 output, 1 latency	22%
12	Qwen 3.5	7.8	10.0	Together AI	2.7	3 output, 3 latency	78%
13	DeepSeek 3.2	4.9	6.0	Fireworks AI	2.3	2 output, 2 latency	49%
14	Mistral Small 4	9.4	9.3	Mistral	9.7	3 output, 3 latency	94%

Simple Tama Agentic Workflow - Q1 2026 | Upmaru

Setup

Architecture

LLM Tasks

Results

Summary

Stay notified when new LLM test results are published.