AI Agents Run To Completion

The Agentic Web has two requirements that have to work simultaneously: trust and observability. Trust without observability is recklessness. Observability without trust is security theater. Most organizations are discovering they have neither, certainly not at scale, which is why agent pilots aren’t graduating to production.

The question CTOs are asking is how do we give agents write access to production systems without creating a career-ending incident? The answer everyone seems to want is “better prompts” or “fine-tuning” or “guardrails.” But the answer that actually works is treating agent reasoning as infrastructure. How are we building this?

Production-ready infrastructure doesn’t just ask ‘what did the agent do,’ but ‘what was it thinking when it did it.

Google Cloud automatically enabled OpenTelemetry ingestion endpoints for all projects on Wednesday (March 4.) This matters because modern observability infrastructure treats telemetry as first-class code, with automated agents using this data to perform self-healing deployments. The shift is from humans debugging what agents did, to agents debugging themselves using the same telemetry infrastructure. So maybe it’s actually O11y 3.0.

Crucial to this is what your implicit model of ‘coding’ is, architecturally. Consider Github Agents, the somewhat anticipated product released late January by. what is presumptively a top notch product engineering team, but which treats coding as a series of chat sessions, rather than a distributed systems problem, as Spotify’s Honk, for example, does. (Protip: you can’t solve a layer 2 problem with a layer 1 tool.)

OpenHands (formerly OpenDevin) hit v1.3.0 in February with full support for Agent Client Protocol (ACP), making it compatible with almost every modern IDE and CI/CD pipeline. The project has arguably become the most important open-source initiative in software engineering for 2026. Unlike closed agents where reasoning is opaque, you can see the “Thought Content” in OpenHands. The reasoning chain is visible, traceable, auditable. When an agent makes a decision, you can see why. When it fails, you can debug the reasoning.

This visible reasoning chain points to something bigger: when you pipe an agent’s chain into a modern traceability pipeline (Honeycomb, Chronosphere, Grafana Cloud), you treat AI reasoning like a first-class execution trace. Not as metadata or logs, but as spans in a distributed trace where you can see the decision chain that led to every action. No longer using system level events as proxy to infer explanation.

This is what production-ready agentic infrastructure looks like. When an agent makes a decision at 2am that takes down a service, you don’t reconstruct what happened from logs. You have the full reasoning trace showing: what context it had, what options it considered, why it chose what it did, what it expected to happen. The same infrastructure you use to debug distributed systems now debugs distributed intelligence.

The architectural requirement is straightforward but most organizations haven’t built it: every agent action must generate a trace that captures both execution and reasoning. Thought Content provides a reasoning chain that can be captured via its event stream and exported to OpenTelemetry as custom span attributes or logs. Without this, you’re deploying black boxes to production and hoping nothing breaks in ways you can’t explain to compliance. Which is to say, nothing breaking.

Organizations treating agent memory as ‘just a bigger context window’ are solving the wrong problem.

Snowflake Cortex fully integrated AI functions into standard SQL in February, allowing real-time text-to-SQL and unstructured data analysis without leaving the warehouse. MongoDB’s Voyage 4 models set new standards for retrieval accuracy in RAG. But the real shift is episodic memory: platforms introducing ‘state management’ for agents directly on the data tier.

This is the convergence of the agentic data stack. Agents don’t just query databases. They maintain long-term memory of previous transactions and interactions stored directly in the data layer. When an agent needs to understand what happened last week or why a previous decision was made, that’s not in a prompt or a context window. It’s not even in a database per se, but committed to the agent’s knowledge graph: versioned, queryable, auditable, and in principle, can be content-addressable.

Mastra’s Datasets (18 Feb) exemplify this: versioned test cases with native JSON schema validation and SCD-2 versioning (the data warehousing technique of using surrogate keys for immutable updates.) What’s interesting here is an ostensible application framework is taking on deployment responsibilities. The primary use case isn’t running the agent, it’s providing a fully hydrated test harness for every agent release. Their Observational Memory (OM) hit 94.87% on LongMemEval with GPT-5-mini, which matters less as a benchmark than as a pattern: agents need memory systems that persist across sessions, survive restarts, and provide context without consuming the entire context window. Organizations treating agent memory as “just use a bigger context window” (or even a better managed one) are solving the wrong problem. This is how the AI-native SDLC in general, and reasoning as infrastructure in particular collapse the stack to solve the problem of state compression.

The new standard isn’t ‘alerting’; it’s the collapse of the stack, where observability data becomes the direct input for autonomous self-healing

The breakthrough is closed-loop systems where measurement, action, and learning happen in the same stroke. Google Cloud’s OTEL announcement isn’t so much about better dashboards, though that’s a welcome side effect. The bigger impact is on agents using telemetry to automatically remediate issues without human intervention. Not ‘alert the on-call engineer.’ Execute the fix, document what was done, update the runbook. It’s what puts the ‘great’ in The Great Replacement.

This only works if the observability infrastructure captures agent reasoning. When an agent makes a breaking change at 2am, the telemetry needs to show: what failure it detected, what remediation options it considered, why it chose this specific fix, what it expected the outcome to be. If something goes wrong, you’re not reconstructing from logs. You’re replaying the reasoning trace along side (CQRS) commands to understand what the agent got wrong.

Companies actually shipping this aren’t just treating observability as monitoring infrastructure, despite the name. They’re also treating it as the trust layer that makes autonomous execution possible. Without it, you have agents operating in production with no way to explain their decisions. With it, you have auditable, reproducible, debuggable autonomy.

The trust model that works isn’t ‘limit what agents can do’ — it’s “sandbox everything and make all reasoning auditable”

The stark reality is that you can’t prevent agents from trying things you didn’t anticipate. (Just ask OpenClaw.) The trust model that works isn’t “limit what agents can do” (they’ll find ways around it.) It’s “sandbox everything and make all reasoning auditable.” When agents generate code, you need runtime guarantees that even incorrect code can’t violate memory safety or create persistent vulnerabilities. When agents make decisions, you need telemetry that captures the reasoning chain. When agents maintain state, you need (vector) databases that version everything immutably

This is why Rust rewrites matter, and moreover why Monty is written in Rust. Memory safety isn’t optional when you’re executing arbitrary code generated by AI. Following 2025’s regulatory pushes, 2026 has seen massive infrastructure rewrites. Major tech companies completing critical kernel and middleware rewrites in Rust to comply with new global memory-safety standards. The “memory-safe mandate” is forcing architectural decisions that seemed theoretical last year.

The architecture is: sandboxed execution + reasoning traces + episodic memory + self-healing remediation. Remove any component and production deployment becomes reckless. The majority of orgs have at most two of four. The gap between those with all four and those still workshopping governance is the space where CTOs, both FT and fractional need to put on their CDAO hat and conduct gap analysis.

Immediate (30 days): Implement reasoning traces for every agent in pilot. If you can’t reconstruct why an agent made a decision, you can’t deploy it. Use OpenTelemetry with agent-specific spans. Store traces for 90 days minimum. Build a dashboard showing agent decision paths for the last 1000 actions.

Near-term (60 days): Deploy episodic memory infrastructure. Agents need persistent state that survives restarts. This isn’t “use a vector database.” It’s versioned, auditable state management with immutable updates. Take a look at Mastra’s Datasets approach and/or build equivalent. The test harness for agents needs to be as rigorous as your production CI/CD pipeline.

Strategic (90 days): Migrate observability infrastructure to O11y 2.0. Not better logging. Telemetry as first-class code where agents can read their own traces and self-remediate. This is the unlock for autonomous operations. Without it, you’re stuck in human-in-the-loop forever.

The Agentic Web doesn’t wait for organizations to be ready. The question is whether your infrastructure can support what agents actually do, or whether you’re still treating them as chatbots with API access. The gap between these two approaches is the difference between production deployment and expensive demos that never ship.

AI Agents Run To Completion

Next time: Where is MCP headed? How to tell what’s momentum and what’s inertia.

Discussion about this post

Ready for more?