One Stores Things; The Other Thinks

16 min read Original article ↗

There is a moment in every conversation where a pause stops being natural and starts being wrong. Between humans, that threshold is about 300 milliseconds. Below it, the rhythm feels normal. Above that we shorten our sentences, and the conversation becomes a series of transmissions. At 100ms, it feels instant. At 300ms, you notice.

At 500ms, you’ve connected a state-of-the-art multimodal AI to a walkie-talkie.

We are building an AI platform, and have a voice interface on the roadmap. Our users are all over the world, on unreliable cellular and WiFi, in San Francisco, London, Dubai, Sydney. Our origin server is in Oregon. The physics of that distance, through that network weather, makes real-time voice impossible on a traditional client-server architecture.

Agentic AI changes what counts as latency. In a multi-agent system, every tool call, safety check, retrieval hop, and model invocation (some local, some at the edge, some in a hyperscaler datacenter) becomes part of the user-visible serial path. The architecture that connects them becomes the bottleneck.

You can’t magically bolt 50ms latency onto a legacy architecture, negotiate with the speed of light, or fix the last mile.

The only thing left is to change the architecture.

This essay is about why we had to, what we changed it to, and why the pattern we landed on feels to me like a fundamental shift in how software (and agentic systems) will be built on the internet.

(It’s out of scope for this essay, but while low latency is nice for UX, for embodied agents it is a prerequisite for functional safety.)

To understand why the architecture had to change, we need to understand what actually happens to your packets on a bad network.

When your user is on a congested WiFi network, packets drop. That is the physical reality of radio waves in a crowded room. When a packet drops, TCP (the foundational protocol of the internet) assumes the broader network is hopelessly congested. To prevent a meltdown, it panics and drastically slashes your connection speed. This self-preservation mechanism is called exponential backoff.

To rebuild that speed, the client and server have to successfully pass packets and acknowledgments (ACKs) back and forth to prove the network is clear.

Here is where the physics bites: the time it takes to complete that recovery cycle is dictated by the physical distance between the user and your server—the Round Trip Time (RTT). Each back-and-forth takes time, and if your server is in Oregon and your user is in London, every acknowledgment takes a 150-millisecond round trip.

Because it takes so long to acknowledge packets over that physical distance, the recovery is brutally slow. Each dropped packet forces you to pay that 150ms penalty, just to try again. This means your actual network throughput degrades inversely proportional to your RTT and the square root of your packet loss.

(Network engineers call this the Mathis equation. We just need to remember that distance multiplies the penalty of a bad connection.)

A 150ms RTT to Oregon over bad WiFi quickly cascades into a 3-to-8-second page load. A 15-second timeout. A user standing there awkwardly at a spinner.

When proposing edge compute to solve this, you might encounter this piece of industry skepticism: If the last mile network is congested and unpredictable, the user is going to have a bad experience anyway. Even after optimizing the transit, the latency will still be high enough to cause a bad UX, so why bother?

The first part of that is right: you can’t magically guarantee a pristine 20ms latency on a saturated connection. If you have a massively congested connection, you will have packet failures, which creates effective latency.

But the conclusion misses a fundamental mechanic of network recovery. The goal isn’t just to move compute closer; the goal is TCP termination right at the edge.

Because the last mile is heavily congested and unreliable, you don’t want to keep connections open for long periods while packets traverse the rest of the network back to the origin. By terminating TCP and ACKing packets at the edge, your connections only have to cross the local last mile.

This round-trip time is much shorter, so bandwidth recovers faster, and users suffer fewer TCP exponential backoffs and restarts. We can’t fix the last-mile, but we can isolate the penalty to the last mile instead of multiplying it across a cross-country (or under-sea) transit.

Edge skepticism usually boils down to outdated heuristics.

Not so long ago, it was impossible to start up a zero-latency worker that had any real capability. Historically, edge functions were dumb, highly constrained, and expensive. It made sense to dismiss them as “just a CDN for caching.”

Today, that heuristic is completely wrong.

Modern platforms give you a full V8 runtime, distributed SQL, message queues, and durable objects. The category name (”Edge”) stayed the same, but the capability completely transformed. We are no longer just caching static assets; we are executing relational queries and AI security classifiers.

One stores things; the other thinks.

About once a decade, there is a fundamental swing in how we build software.

Mainframes to minicomputers.

Client-server to the web.

On-prem to cloud.

Each transition followed the same arc: the new paradigm looked like a toy, the old guard dismissed it with legitimate-sounding objections rooted in real experience, and then the economics and the physics made the outcome inevitable.

What makes this particular transition different is that it is not just about where compute lives. It is about what the network itself is becoming, and why AI (in the myriad physical and ethereal forms it’ll take) requires this particular evolution.

It’s not just about caching, and we didn’t just add a CDN. We shifted to a fundamentally different, architecturally simpler pattern: the Origin-Edge-Client model.

We deployed a Cloudflare Worker that sits between the mobile app (Client) and our Express backend (Origin). It is a distributed compute layer running:

  • Edge SQL (D1) for chat history with cursor-based sync.*

  • KV Cache for user profiles and auth tokens.

  • JWT Verification executed entirely at the edge against cached JWKS.

  • Message Queues for fire-and-forget writes.

  • A Rust Security Scanner compiled to WASM.

Did that bot just defame someone?

Because the security scanner can run inside an edge worker, every single prompt gets scanned before it ever touches our backend. We can check everything, always, with zero latency penalty. Without edge compute, every single security scan would require a round-trip to Oregon. One of those truisms of security is that it will get disabled if it degrades the user experience. A fundamentally faster architecture allows us to spend more resources on security, without compromising the user experience.

The power and elegance of this pattern has this stubborn Python missionary falling for WASM and V8.

* And brushing up on CAP theorem. 🤷🏾‍♂️

The win is not that the edge removes tradeoffs; it is that it lets you choose them deliberately: strong consistency where correctness demands it, and global low-latency everywhere else.

Once you move compute and data to the edge, the geography of your application inverts. Your users are no longer far from your logic. They are never more than the last mile away from it, and that last mile is the same one they’d have to cross to reach anything on the internet.

Depending on your cellular provider or ISP, the time to reach the edge will vary. It will probably be less than 30ms for most places, and certainly less than 5ms with home fiber. You won’t be able to save the trip to the origin 100% of the time, but for a massive percentage of requests, you can compute and cache right there.

Edge doesn’t make ASR, reasoning, or speech synthesis magically fast; it removes avoidable network seriality so the latency budget gets spent on inference rather than on dragging packets across continents.

You gain a new compute and storage tier with globally consistent latency to clients, without having to manage the deployment and scaling of the underlying services. This is a managed service where the juice is worth the squeeze.

If you have to serve end users all over the world in diverse network weather, why would you ever provision a small ‘n’ of servers in a handful of datacenters when you could instead be at every major internet peering point… for the same or lower cost?

This brings us to the final benefit of the Origin-Edge-Client pattern: it allows you to aggressively specialize your backend.

For years, the application backend was a tangled mess of web routing, business logic, auth, and database syncing. But when the edge handles your global routing, cached reads, auth, and absorbs all the malicious traffic, your origin no longer needs to act like a traditional web server.

The backend becomes an AI factory.

Because the edge layer acts as a massive shield and router (increasingly running accelerated inference on hardware like Apple Silicon or mobile TPUs or NVIDIA Jetson or IGX) the origin only ever sees clean, authenticated, highly specific workloads. You don’t need to waste expensive, high-bandwidth compute cycles on parsing JWTs or filtering junk traffic.

In my prototype stack, pushing the web-tier complexity to the edge meant my Orgin could focus on multi-model AI orchestration and managing a vector store. At enterprise scale, by stripping away the web-tier complexity, your origin is freed to do the one thing it is built for. When the edge absorbs the chaos of global distribution, the origin becomes a pure, uninterrupted engine of massive intelligence.

Many folks assumed that any call to an LLM would always be slow and never considered latency a solvable problem. At Kimono, we took the other bet, and treat performance as a first-class design goal for our agentic AI platform. We bet that agentic systems would be bottlenecked by how fast we could move information between agents and parallelize them. That’s why we built our security service in Rust to scan absolutely everything, in (imperceptible, even when chained) microseconds. When agents start calling other agents, each sequential hop is a serial fraction. The difference between a sluggish agentic system and a fast one reduces to: how many of those hops we can eliminate or parallelize, and how fast we can move context between them.

We’re just copying Nature

Biological or artificial, the hard limit on intelligence is the bandwidth between memory and compute.

On chip, it’s the bus between cache and ALU.

In a datacenter, it’s the fabric between GPUs and between GPUs and storage.

In the brain, it is the density of synaptic connections between neurons. The corpus callosum (the largest white matter structure in the human brain) exists to provide massive bandwidth between the left and right hemispheres. It contains roughly 200 million axonal fibers, and when it’s damaged or severed, neither hemisphere loses its own processing power. Intelligence degrades as the bandwidth between them degrades.

The speed of thought is gated by how fast you can move information to the thing that processes it. It’s why (if there was a “why”) neurons do both, I reckon?

The Vera CPU announcement at GTC this week made memory bandwidth an explicit design priority. It felt like a confirmation of something we had already discovered the hard way at the application layer. NVLink exists for the same reason. And on GPUs, those export-controlled wonders of brute compute, while each generation’s new SMs take top billing with transistor counts and TOPS figures, most of the energy is spent getting the bits to and from those beasts.

Now extend that principle beyond the chip, beyond the rack, to the network itself.

There is a useful way to understand two fundamentally different philosophies of building systems.

Intel was, for decades, the Moore’s Law company. Make the transistor smaller, faster, and cheaper, and trust that software would absorb the gains. You didn’t rethink the architecture, the code just ran faster as the hardware got upgraded.

NVIDIA has always been an Amdahl’s Law company, and they wear it on their sleeve. Jensen has said it plainly: the idea that you would use general-purpose computing and just keep adding transistors is “so dead.” Amdahl’s Law says the maximum speedup of a system is limited by the fraction of work that cannot be parallelized. You cannot optimize your way out of a bottleneck by making the non-bottlenecked parts faster.

You have to restructure the system to eliminate the constraint. As Jensen frames it: you want to accelerate every single step, because only in doing the whole thing can you materially improve the cycle.

This is why they don’t just ship faster GPUs. They build NVLink, DGX, and InfiniBand because at every scale there is a bandwidth wall between compute and the next piece of compute or storage, and the work is finding those walls and tearing them down.

We kept arriving at the same conclusion from the application layer. The bottleneck is never the compute. It is always the distance between things: between the user and the logic, between memory and compute, and the serial hops that connect them.

When Dean and Ghemawat published the MapReduce paper in 2004, the insight was not "use more servers." It was that search could be restructured as a parallelizable operation across commodity hardware, turning the coordination layer into the product.

Before Google, DEC’s AltaVista was the undisputed champion of search. DEC was also the maker of the legendary Alpha 21064 processor. It was the first 64-bit RISC processor, and it was one of the fastest processors on earth. I wanted a DEC Alpha t-shirt when many kids my age had Countach posters.

To show off the Alpha, DEC created the search engine AltaVista. It quickly beat all the other search engines on the market. The raw power of the Alpha gave AltaVista an advantage. It didn't matter, because the serial constraint everyone else accepted (you need a bigger, faster machine) was reframed by Google as an architectural choice that could be eliminated. The most powerful machine in the old architecture inevitably gets replaced by simpler things that looked like toys. This is the Innovator’s Dilemma of systems design

Intelligence runs in parallel. Intelligent systems need to be seen through this lens to be optimized for it.

Moore’s Law is about making the component better. Amdahl’s Law is about making the system faster. Making your origin server faster is Moore’s Law: improve the component, hope the system benefits. The gains are real and free, but they’re available to everyone else, too.

When your users are scattered across the globe on unreliable networks, the origin is not the bottleneck. The network transit is the serial fraction, and no amount of origin performance will fix the physics of a 150ms round trip through a congested last mile. The only way through is to restructure the system so the bottleneck itself disappears.

The modern network is not a bunch of dumb pipes. It is practically all software-defined. Every hop is a programmable computer. With AI workloads flowing through it, the network is becoming something closer to a living, adaptive system—routing, caching, and computing at every tier based on the demands of the moment.

And the demands are about to get dramatically more intense. We are heading toward a world where your phone is simultaneously pushing work to a local on-device model for instant inference, an edge node twenty miles away for low-latency orchestration, and a hyperscaler origin for heavy multi-billion-parameter reasoning. AR applications, real-time voice agents, and spatial computing will require all three tiers working in concert. 5G and edge compute are prerequisites. The edge may even move closer to cellular base stations to shave off the last few milliseconds, perhaps even add yet another tier (modern CPUs have 3 cache tiers.)

Your phone will not be making one connection to one server. It will be routing requests to the same company’s services across multiple physical tiers (edge, origin, maybe multiple origins) simultaneously. A single user action might fan out across different compute layers, different geographies, different hardware classes, all stitched together by an intelligent network that knows where to send what.

We are not just adding a caching layer to the internet. The network itself is becoming a tiered compute architecture, with intelligence distributed across every layer. When I reason about it, the right analogy feels closer to neurons and myelin than pipes and pumps.

Abstractions are helpful, but they leak, and the world evolves. Without a first-principles understanding, how can you know the abstraction serves you, and not just whoever is renting it to you? Jensen has talked about why NVIDIA participates in every vertical it serves: AI software is evolving so fast that it changes the hardware requirements, which in turn changes the capabilities and architectures of the software.

This era of rapid, reflexive evolution up and down the stack will confer structural advantages to teams with the ability to operate from first principles and reimagine entire systems.

The network has always been the lifeblood of intelligent systems, and was never going to stay static while everything above and below it transformed. AI is adding new workloads to existing infrastructure, and it is also fundamentally changing what infrastructure needs to be.

At every transition, the new paradigm gets framed as a toy. Edge is just a CDN. It’s just for auth. We heard this about cloud in 2009 and containers in 2014. I try to be honest about this instinct in myself, because I have been wrong at every one of these junctures. The cloud felt deeply wrong when I was running bare metal. Containers felt wrong when I had mastered VMs. Serverless felt wrong because Lambda never felt fast enough. Each time, my discomfort was real, and my intuition was outdated.

The teams who navigate these transitions best are the ones who understand their own stack deeply enough to recognize which principles are permanent and which are artifacts of a temporary constraint.

The speed of light is permanent. TCP congestion control is permanent. The principle that proximity between memory and compute determines the speed of thought is permanent.

The assumption that your origin server needs to be a fortress, that edge functions are toys, that you need a fleet of managed services to serve a global audience: those were artifacts. They were true once, but aren’t anymore.

“It ain’t what you don’t know that gets you into trouble — it’s what you know for sure that just ain’t so.” The assumptions we carried from the last decade of cloud architecture are that kind of knowing.

When hardware and software co-evolve this rapidly, the architectures that dominate are the ones that ruthlessly redesign to eliminate bottlenecks. Things that were physically impossible for small teams five years ago are now radically accessible. Today, a small team that understands the physics can build a globally distributed AI platform—sub-millisecond edge security, local model routing, and a frictionless origin—that would have required fifty engineers and a seven-figure infrastructure budget not long ago.

The Origin-Edge-Client model is not a neat trick for bypassing bad venue WiFi. It is the inevitable architecture for an era of distributed intelligence. It separates the fast, unpredictable, physical reality of user interaction from the massive, centralized reasoning of your AI infrastructure, and lets a programmable, accelerated network mediate between them.

Sutha Kamal is Chief AI Officer at Kimono, where he leads the design of agentic AI systems that distribute intelligence from origin to edge to device. His career spans nearly three decades of real-time systems: neural networks and computer graphics, VR, mobile games, wireless data platforms, and distributed cloud infrastructure.