A year ago, shipping an LLM feature felt manageable. Not easy. Not perfect. But familiar.
A slow response? You added latency metrics.
A failure? You scanned logs.
A weird output? You copied the prompt into a notebook, tweaked a few lines, and fixed it.
It wasn’t great engineering. But it worked.
Then AI agents happened. And suddenly none of your debugging instincts apply.
Your traces look complete, but they don’t explain why the agent chose that action. Your dashboards are green, but your bill looks like someone lit a budget on fire. Your logs show things happened but not whether the tool call was wrong, retrieval failed, or the model hallucinated with full confidence.
It’s not that observability stopped working.
It’s that your agent is speaking a different language than your monitoring system.
And if you’re still trying to monitor agent workflows using classic HTTP service semantics, you’re basically trying to debug a financial audit with just CPU graphs.
A service fails when a dependency fails. When a timeout triggers. When a deploy goes bad. When a request hits an edge case you didn’t test.
You get a clean failure signal and if you’re lucky, you can locate it quickly.
Agents fail differently.
Because an agent isn’t really a “service.” It’s a decision loop.
It’s non-deterministic by design. It branches. It retries. It uses tools. It consults memory. It makes judgments based on partial information. Sometimes it does something dumb. Sometimes it does something brilliant. Sometimes it does something expensive.
Most importantly: agent execution is entangled with your highest-cardinality, most sensitive data—prompts, tool arguments, retrieved documents, conversation history.
So when an agent fails, the question you’re asking changes.
It’s no longer:
Which endpoint failed?
It becomes:
Which decision went wrong and what was the model thinking when it made it?
That’s the conceptual mismatch.
Here’s an analogy that becomes painfully accurate in production.
Traditional observability for microservices is like tracking packages.
Distributed traces behave like shipping labels:
the
trace_idis the shipmenteach
span_idis a stop along the routeservice.nameis the warehouselatency is delivery time
This works because services are predictable. A trace is a map of a route, and a route tells you a lot.
But an agent is not a package.
An agent is a person running errands.
They’re walking around with a to-do list, making judgment calls:
“I’ll search the knowledge base first.”
“Then I’ll call Stripe.”
“If nothing matches, I’ll query the DB.”
“If confidence is low, I’ll escalate to a human.”
Now imagine trying to debug that person with only:
Person entered store
Person exited store
Time spent: 32 seconds
That’s what your current dashboards actually see.
They can prove that calls happened. They can show time passed. They can highlight an error. But they can’t tell you what the agent was trying to do, what it believed, which decision it made, what tool arguments it used, how much the decision cost, or whether the final output was even correct.
That’s why end-to-end tracing can feel useless for agent systems: complete in the networking sense, incoherent in the decision sense.
Let’s talk about OpenTelemetry GenAI semantic conventions.
Semantic conventions are a shared vocabulary. A grammar for telemetry. Not how you collect data but what your data means. Without shared meaning, instrumentation is performative. You can have perfect coverage and still build nothing reliable on top of it.
Think about how absurd this is:
Two teams emit spans called POST /chat.
One means “OpenAI completion.”
One means “RAG retrieval step.”
One is instrumenting a gateway proxy.
One includes retries.
One doesn’t.
Everything looks legitimate. And none of it is interoperable.
So dashboards drift. Queries lie. Teams stop trusting their own telemetry.
That’s why OpenTelemetry introduced GenAI semantic conventions: to standardize the meaning of agent and LLM operations.
If you instrument an agent correctly, instead of a trace reading like a pile of network calls, it reads like a story.
You can literally watch the agent think:
invoke_agent → chat → execute_tool(search_kb) → embeddings → chat → execute_tool(create_ticket)
With that shape:
tool errors show up as tool errors
retrieval failures show up as retrieval failures
token usage becomes attributable to a specific decision step
you can trace regressions back to an operation type, not just “some HTTP blob got slower”
It becomes debuggable as a workflow.
OpenTelemetry GenAI semantic conventions give you three layers of visibility that work together:
First, a storyline.
Spans stop being generic HTTP operations and become agent-native operations: model inference, embeddings, tool execution, agent invocation. Instead of service B called service C, you get agent invoked → model called → tool executed → model called again.
Second, a scoreboard.
In LLM production, cost and performance are fused. You can’t treat latency like an SRE-only metric anymore because it directly reflects UX and model behavior. And tokens aren’t usage, they’re unit economics. If you track only latency, you’ll miss the budget explosion. If you track only tokens, you’ll miss the operational regression that caused the spike.
Third, an evidence box.
When you truly need to know what the model saw and said, OTel supports opt-in structured events for inference details. Not because it loves verbosity. Because span attributes have size limits, structured fields behave differently across languages, and prompt capture needs governance.
The bigger fragmentation isn’t observability. It’s economics. Even if your traces are perfect, you’re only seeing one layer of the stack. The bigger fragmentation is cost.
Because AI cost doesn’t live in one place.
If you zoom out, your agent spend spans:
LLM API bills (tokens, modalities, caching, tiered pricing)
infra (GPUs, clusters, gateways, networking)
data platforms (warehouses, ETL, vector DB clusters, storage, egress)
shared services and platform overhead
These costs flow through different pipelines. They’re owned by different teams. And they often don’t share a join key.
So cost attribution stays fragmented for structural reasons.
Token accounting, for example, is messier than input/output. Providers charge differently for cached tokens, reasoning tokens, audio/image tokens, and other modalities, dimensions that aren’t fully standardized yet.
Cost itself is also not universally standardized as a first-class metric in GenAI conventions. Some gateways emit cost directly (as an extension), other platforms compute it via pricing tables. Those methods can both be correct locally and still disagree globally.
Meanwhile, infra cost(GPUs, clusters, gateways) flows through billing exports and allocation tooling, not GenAI spans. It lives in finance-shaped data, not tracing-shaped data.
And then there’s the part nobody wants to admit out loud:
Business attribution isn’t standardized at all.
If you don’t attach stable identifiers like tenant, use case, team, and environment, your cost conversations will devolve into politics. When cost attribution works, the experience is night-and-day. You can answer:
“Why did this feature cost $14K last month?”
in under a minute, broken down by model, provider, use case, and owning team.
When it doesn’t, you’re in a spreadsheet war with finance, arguing about tags, ownership, and whose budget this belongs to. That is what enterprise AI cost really feels like. Because once agents hit production, organizations need answers that cut across engineering and finance:
Which model/provider actually served traffic?
What token spend looked like by use case and owner?
Which tool decisions caused failures?
Which regressions correlate with model rollouts?
Who owns the economics of this workflow?
These aren’t purely engineering questions anymore.
They’re CIO/CFO questions.
Semantic conventions are the bridge. They make telemetry legible across the org. They turn engineer traces into something leadership can trust.
If your telemetry can tell you the agent was healthy but can’t tell you why it made the wrong decision, is it really observability or just logging?
Start with four moves that pay off almost immediately:
Instrument the workflow, not just the SDK. You want traces that tell the story: agent invoke → model call → tool call → retrieval → model call.
Treat tokens as your unit cost. Track them by model, provider, operation, and business dimensions. Even if your token breakdown isn't perfect yet, input/output is enough to build attribution muscle.
Prompt capture is a security decision. Capture selectively. Redact at the collector. Prefer external storage with references when you need deep debugging in production. Don't turn sensitive content into default telemetry.
Attach business context early. If you want cost attribution later, you need use case, tenant, team, and environment now. Otherwise every cost conversation becomes guesswork and politics.
