When you design a system, especially a distributed one, you eventually run into the question: should this be choreographed or orchestrated?
Most engineers hear those words in passing and nod along. They sound abstract and academic. But your choice here defines how your system behaves in production, how easy it is to debug and how flexible it will be as your product evolves.
This article is for anyone who’s had to glue services together, add retry logic, or figure out how to run workflows offline on a plane. I’m going to walk through orchestration and choreography from first principles, with real-world tooling and advice. I’ll also link back to a previous post I wrote about making things easy to delete — a topic that overlaps more than you might think.
Orchestration and Choreography: Two Ways to Get Work Done
Choreography is when each part of the system listens for signals and acts when it sees something it cares about. There’s no central authority. Think Kafka, EventBridge, or a queue of events where services just do their part.
Orchestration is when a central brain tells each part of the system what to do, when and how. It knows the steps, the order and who does what. Think Temporal, AWS Step Functions, Airflow, or Hatchet.
Orchestration is easier to debug. Choreography is more flexible. Orchestration makes change feel safe. Choreography lets you scale without asking permission.
There’s no silver bullet. Most production systems use both.
It’s common to hear an engineer say, “we need a message-based architecture” and then go all-in on choreography. Kafka enters the picture and suddenly everything looks like an event. The hammer is here and every problem becomes a nail. But reality catches up. You need to coordinate workflows. You need consistent error handling and retries. You need to understand what step failed and why. That’s when orchestration reenters the conversation. The best systems aren’t dogmatic — they mix both approaches where they make sense.
When to Use Which
Use orchestration when:
- You need to guarantee the order of steps
- You want to retry failed tasks in a structured way
- You want a clear picture of what’s happening and where things are stuck
Use choreography when:
- You want each service to evolve independently
- You’re building something inherently event-driven
- You want new components to “just work” by listening to existing events
An e-commerce checkout is often orchestrated. A logging or metrics pipeline is often choreographed.
Circuit Breakers and Reliability
No matter which you choose, parts of your system will fail. Circuit breakers are how you stop one failure from becoming ten.
In orchestration, this means catching exceptions and marking workflows as failed (or retryable). In choreography, it means not blindly processing every event forever.
Tools like Resilience4j (Java), pybreaker (Python) and Envoy/Istio at the network layer help.
But circuit breakers are only useful if you can trace the failure. So let’s talk observability.
Tracing and Debugging
When you have a request flowing through dozens of services, logs aren’t enough.
OpenTelemetry gives you vendor-neutral hooks. Plug it into Datadog, Honeycomb, or Jaeger. Add a trace ID at the edge of your system and keep it in context through every function and message.
That said, you don’t have to root everything in OpenTelemetry. Tools like Datadog can handle tracing natively, often with less setup and without the additional overhead of integrating the full OpenTelemetry stack. Especially for smaller teams or legacy systems, native tracing can be faster to roll out and easier to maintain.
When debugging:
- Start with the trace view
- Correlate to logs by trace ID
- Drill down into timings and errors
You should also design your systems with this in mind. A trace ID should be passed in every task, every queue, every DB write. If you’re not doing that, you’re flying blind.
Queue Management and Flow Control
Any system that does orchestration will eventually bump into queue management problems. You have workflows piling up, tasks waiting to be executed and workers pulling jobs at different rates.
This is where queue theory comes in. Concepts like backpressure, bounded queues and rate limiting aren’t just academic — they directly affect system stability. If your orchestrator pulls too many tasks too quickly, downstream systems collapse. If your queues back up without alerting, you’re flying blind.
You don’t need to go deep into math to see the impact. If the task rate in exceeds the processing rate out, the queue grows. If the queue grows too long, you need to decide: drop work, alert someone, or scale up.
Orchestration systems like Temporal or Hatchet can hide some of this with built-in rate control and retries, but understanding the flow at a high level is still critical.
This article won’t go deep into queue theory, but we’ll come back to it in a future post. For now, know that orchestration means you’re in the business of flow control.
Event Replay and Time Travel
Sometimes a system needs to rewind. Maybe to reprocess bad data, or because you found a bug.
In orchestration systems like Temporal, this is built-in. You can reset a workflow and replay it from a point in time.
In choreography, you’ll need to record events (Kafka, NATS, etc.) and manage offsets yourself. This adds overhead but gives you raw power. Some teams even store events forever just for this reason.
If you want to make your system easy to debug, test and reprocess, build with replay in mind.
Local and Offline Development
You shouldn’t need cloud credentials just to run your system.
Some systems support what I call “airplane mode”:
- Celery with
ALWAYS_EAGER=True - LocalStack to fake AWS services
- Kafka with a local broker
- Docker Compose to wire everything up
This isn’t about mocking. It’s about running the real thing, just locally. If your architecture requires three VPNs and five cloud accounts to test a feature, you’re going to have a bad time.
Developer experience matters. Offline systems are easier to understand, test and change. And this isn’t just about convenience — in many companies, the biggest line item is engineering time. Every hour lost to broken tooling, confusing environments, or context switching is money burned. Improving developer experience means less time spent on toil and more time building actual value (future article?).
Returning Results from Background Tasks
Offloading work to the background is one of the best ways to speed up a request/response cycle. You don’t need to generate a PDF or send an email before you respond to a user.
But sometimes you do want to show them the result. That’s where WebSockets and Server-Sent Events (SSE) come in.
WebSockets let you push updates in real time. SSE is simpler and uses plain HTTP, but only supports one-way communication. SSE has also been around for a long time and usually requires less code and less setup to get working reliably. You don’t need to worry about connection negotiation or complex protocol handling — just push an event over HTTP and you’re done.
From an infrastructure perspective, both approaches require careful handling of open connections. Long-lived connections increase memory usage, introduce challenges with load balancers and proxies and make retry logic on the frontend more complex. WebSockets amplify this because the bidirectional nature often means custom routing, more state to manage and higher expectations for low latency.
Use SSE when:
- You only need to push from server to client
- You’re working in a constrained environment (e.g., older browsers or server infra)
- You want the simplest working solution with minimal code
Use WebSockets when:
- You need two-way comms
- You want fast, low-latency updates
- You can afford the extra complexity and infrastructure
Both work well to tell the user when a background task finishes. They’re better than polling and keep your UI reactive. But remember that the cost isn’t just in performance — it’s in maintenance and infrastructure as well.
Tooling: Language Specific vs Language Agnostic
Some tools are deeply integrated with a language:
- Temporal has SDKs for Go, Java, Python and TypeScript
- Celery is Python-native
- Sidekiq is built for Ruby
- Hatchet supports Python and TypeScript with a developer-first workflow engine
Others work across stacks:
- Kafka
- NATS
- Redis Streams
Language-specific tools give you nicer DX (developer experience). Language-agnostic tools give you better system-level interoperability. Choose based on your team’s needs and your system boundaries.
How to Think About These Systems
You should design your system around change, failure and deletion.
Choreographed systems are easier to extend and harder to debug.
Orchestrated systems are easier to understand but can become bottlenecks.
Both need:
- Good traceability
- Clear retry semantics
- A way to reason about what happens when something breaks
If you keep that in mind, you’ll make better calls at the edges.
Also: make things easy to delete. If you haven’t already, read this.
Wrap-Up
You don’t have to pick orchestration or choreography for your whole system. Most systems use both.
Pick orchestration where it helps you debug and test. Pick choreography where it helps you scale and adapt.
And always keep the developer experience in mind. A system that’s painful to run locally or impossible to trace will rot faster than you expect.
Got a workflow that keeps failing? Try tracing it.
Got a slow request? Try offloading it.
Rather watch a video about how this?
I have a great video that I refer people to for this subject — it really highlights the real world applications of how this all can work together. Thank you Robbie Kohler!