Happy Friday! As usual if you don’t care about my LLM ramblings I’ve included a brief life update at the bottom of this article.
For the last year, I've dedicated a serious amount of time to building multi-step LLM agents at Lookbk and for other projects around search, chat, and other workflows where non-linear and non-deterministic processes thrive..
Orchestration is one of the most "magic-based" processes still left in building LLM-based systems. Despite all the frameworks out there to build agents, I've found that the easiest method is just to create your own structure because every task is so unique.
I define "agents" as a single LLM function with a unique system prompt and set of tools defined for a specific purpose. “Agentic” workflows consist of multiple agents compiled together.
Common patterns I use (or have seen used) for building agents break into four main categories:
Sequential Processing
Process each step in a predefined order - closest to classical programming but using LLMs as evaluators instead of static if statements
Routing
A single LLM or traditional classifier makes a decision about which agent to use.
Orchestrator-Worker
A primary LLM coordinates various specialized worker agents—similar to routing, but orchestrators manage overall context while delegating specific instructions to multiple sub-agents.
Evaluator-Optimizer
A dedicated optimization or evaluation step is added in which an agent assesses intermediary results before deciding whether to loop the workflow or continue.
While the evaluator optimizer is looped by default, most systems will also have looping for the first three categories as well - often with some sort of evaluation system as an intermediary step or a tool call that serves as a sign of when to “break” the loop.
While these are great building blocks, you still face several core problems:
If your agent is user facing, how does it interact with the user?
Does each individual agent interact with the user, do you have a user interaction layer above, or is there some other pattern you use to split "thinking time" and response time for users?
How much visibility should the user have into the agent lifecycle (tool calls, handoffs, iterations, etc.)
How do you manage data and validate state throughout an agent lifecycle?
Do you use tools to create properly formatted data components, then collate them at the end?
How is data passed into system prompts and chats to ensure each agent has access to the same data?
If running agents or tools in parallel, how do you handle concurrency and CRUD conflicts?
How do you evaluate and determine which pattern(s) work best for your specific use-case?
Do you care about cleanliness of intermediary steps (which often correlates to speed or only raw output
While there’s infinite variations the above, after lots of experimenting, evaluating, and pain, I’ve come to the following process to determine the right makeup of an agentic system.
Here are some concepts I’ve discovered helpful for distinguishing and building AI systems.
Decide customer facing vs. internal before building
This is one of the prime determinants of how you build the system. At Lookbk, we use two different agents: one that's customer-facing and handles chat (mostly low-latency operations that queue events), and another internal agent that processes the actual search.
When building customer facing agents, speed (or the appearance of speed) is critical. Critical elements include: streaming responses, showing handoffs when using sub-agents, implementing strong error handling (so users can see when things go wrong), and setting hard limits for recursion depth.
Internal agents tend to have a focus on self recovery, traceability, and a strong evaluator that can handle task completion autonomously. I’ve found it helpful to give lots of tools to these types of agents (and lots of subagents) so it can take the time to figure out issues on its own, taking as many recursive loops as needed.
Keep Agents Stateless
I’ve experimented with almost every type of data / state management option there is for agents - from internal context tracking, external data mutation, and more. The best way I've found to work with data states is to have tools act as "validators" when mutating data components, with your client handling database synchronization through POST requests or DB interaction.
For example, in a form builder you may give different agents tools to edit / add components, steps, and routing logic. Each of these tools should have parameters that outline the data schema - i.e. `component_id`, `component_type` etc. and return the validated data.
To construct your full form schema, you can either parse your streams as you receive them server-side, then emit it as “state” events of the form configuration, or rely on your client to parse them. Regardless of your approach, I’ve found it always easier to manage persistence of your form state on the frontend so it’s visually tied to what the user is seeing.
Test early and often
Test-driven development is instrumental in evaluating architecture types, which / how many tools to provide, model performance, and more scientifically and accurately. Often, I’ll build tests (end to end) before writing any implementation code.
I tend to write two types of tests: end-to-end tests and tool tests.
End to end focuses on user flows, starting from thinking through every possible type of conversation a user can have (including edge cases, conversations without tool calls, questions, etc.), then having an LLM generate 5-6 different variations of that conversation. If the output is some data mutation, I’ll test for the raw tool outputs - ensuring number of tool calls, general state, etc.When testing conversational responses, you can check for keywords or semantic meaning (embed and calculate cosine similarity between the response and expected response)..
Tool tests focus on edits / mutations to existing data, targeting the call / edge cases of a single tool within a single sub-agent. These help you figure out your tool structure early and determine how to split up your agents.
Treat each pattern as a “step” in the process
The common patterns I cited in the beginning of this article are simply building blocks and will often run in parallel. You may have a few sequential steps that process required information, then an evaluator driven loop that runs multiple optional parallel sequences to ensure all operations are ran in the correct order.
Stream agent process to the client.
As I mentioned before, I prefer to handle data persistence on the client so it can act as the ultimate state manager tied to UI. This means that I’ll often stream back intermediary responses as the agent operates that may be slightly separate from the actual inner-workings of the agent, but act as user-facing states with easy-to-read descriptions you can show in the chat.
Build developer tooling around testing on live data
My favorite tool at Lookbk was a test utility that could pull in live conversations with an AI agent (tool calls and all), cache it locally, and construct a test suite around the response and various tool calls. We did this for both our client facing and backend search agents.
This is partially why building thorough tracing, internal visibility, and separation of concerns is critical - as you scale these systems in production and inevitably catch edge cases - you want to make it as fast as possible to reproduce bugs, adjust prompts/infrastructure, and push fixes to production.
Test at least 4-5 models for each LLM call
Once you have a base testing suite you can either automate or manually swap in and out models - you may be surprised about the differences in performance - especially on a task by task basis.
For example, we found o3 and other reasoning models terrible at tool use but great at orchestration, open source models were cheap and fast for small changes (think json re-writes) but not great at tool use.
Every AI system will be different and there’s lots of performance to squeeze out of this final layer.
Building effective multi-step LLM agents requires thoughtful orchestration and a deep understanding of your use case. The patterns I've outlined—Sequential Processing, Routing, Orchestrator-Worker, and Evaluator-Optimizer—are building blocks that can be combined and adapted to your needs.
Key takeaways:
Choose architecture based on whether agents are customer-facing or internal
Keep agents stateless and handle persistence separately
Invest heavily in testing infrastructure early
Stream responses for better user experience
Test multiple models for each component
While frameworks exist, creating your own structure often yields the best results because every task has unique requirements. The "magic" lies not in any single pattern, but in understanding how to compose these patterns effectively for your specific use case.
Every day feels like heat training when it’s 78 degrees and humid at 7AM but been putting in some mileage - thinking about doing a cross country 5k as my first race of the season in September
In case you haven’t seen I’ve been quite active on Tiktok / Instagram - roast my content / reply pls :)
Discovered my newest food obsession - rice cooker chicken. It’s super simple - rice, chicken stock, ginger, garlic, scallions, and pre-salted / seasoned chicken thighs and I’ve been eating for almost every meal this week
Happy Friday and hope you all have a wonderful weekend.
