There is a temporal trap that catches almost every scaling engineering org. When it comes time to re-evaluate system architecture, the most experienced engineers—the ones who have run systems at scale, survived catastrophic outages, and earned their battle scars—naturally lean heavily on the architectures that survived the last decade.
But when the underlying primitives of technology change, the “safe” way of doing things can quietly become a liability. A proposed architecture doesn’t match the mental model we battle-tested in 2014, so it instinctively feels wrong.
Whenever a team debates moving to an edge-first architecture, a familiar friction emerges. It’s a fascinating look at how our industry evolves, how network physics actually work, and why a specific architectural pattern is fundamentally changing how we build consumer apps.
We’re building a global AI platform. Today, our users are at conferences, on patchy venue WiFi, trying to pull up their digital business cards to show the person standing right in front of them. They are in San Francisco, London, Dubai, Sydney.
Our origin server is in Oregon.
When your user is on a congested conference WiFi network, packets drop. That is just the physical reality of radio waves in a crowded room. When a packet drops, TCP—the foundational protocol of the internet—assumes the broader network is hopelessly congested. To prevent a meltdown, it panics and drastically slashes your connection speed. This self-preservation mechanism is called exponential backoff.
To rebuild that speed, the client and server have to successfully pass packets and acknowledgments (ACKs) back and forth to prove the network is clear.
Here is where the physics trap you: the time it takes to complete that recovery cycle is dictated entirely by the physical distance between the user and your server—the Round Trip Time (RTT). Every single back-and-forth takes time. If your server is in Oregon and your user is in London, every acknowledgment takes a 150-millisecond round trip.
Because it takes so long to acknowledge packets over that physical distance, your recovery is painstakingly slow. Every dropped packet forces you to pay that 150ms penalty just to try again. This means your actual network throughput degrades inversely proportional to your RTT and the square root of your packet loss.
(Network engineers call this the Mathis equation. We just need to remember this: distance multiplies the penalty of a bad connection.)
A 150ms RTT to Oregon over bad WiFi quickly cascades into a 3-to-8-second page load. A 15-second timeout. A user standing there, waiting awkwardly, while the system spins.
When proposing edge compute to solve this, you will often encounter a very rational piece of industry skepticism: If the last-mile network is heavily congested and unpredictable, the user is going to have a bad experience anyway. Even after optimizing the transit, the latency will still be high enough to cause a bad UX—so why bother?
The first part of that assumption is correct: you cannot magically guarantee a pristine 20ms latency on a saturated connection. If you have a massively congested connection, you will have packet failures, which inherently creates effective latency.
But the conclusion misses a fundamental mechanic of network recovery. The goal isn’t just to move compute closer; the goal is TCP termination right at the edge.
Because the last mile is heavily congested and unreliable, you do not want to keep connections open for long periods while packets traverse the rest of the network back to the origin. By terminating TCP and ACKing packets at the edge, your connections only have to cross the local last mile.
The round-trip time is drastically shorter, which means your bandwidth recovers faster. You suffer fewer TCP exponential backoffs and restarts. You aren’t “fixing” the bad WiFi, but you are isolating the backoff penalty to the last mile instead of multiplying it across a cross-country transit.
And to the extent you can move large amounts of data and logic to the edge and drop trips from edge to origin, the last mile is your only mile. Which means that the only users who will get better performance are ones who are somehow closer to your data Origin server than the edge. And while I guess there’s some way that’s theoretically possible, I doubt there is any case where a normal human end user has a faster connection to a cloud datacenter than their local Cloudflare pop.
Right now, for a visual UI, a 3-second delay is an embarrassing annoyance. But looking at our product roadmap, our next major feature is a conversational voice AI agent.
Voice interfaces live and die in the gap between 50ms and 300ms. At 100ms, users notice a delay. At 300ms, they subconsciously compensate—shortening sentences, pausing to let the system “catch up.” Anything slower stops feeling like a conversation and starts feeling like a walkie-talkie.
You cannot bolt 50ms real-time latency onto a legacy architecture after the fact. If we didn’t solve the network physics today, our future flagship feature would be dead on arrival. We can’t negotiate with the speed of light, or fix the venue WiFi, so we changed the architecture.
We didn’t just add a CDN. We shifted to a fundamentally different, architecturally simpler pattern: the Origin-Edge-Client model.
We deployed a Cloudflare Worker that sits between the mobile app (Client) and our Express backend (Origin). It is a distributed compute layer running:
Edge SQL (D1) for chat history with cursor-based sync.
KV Cache for user profiles, business cards, and auth tokens.
JWT Verification executed entirely at the edge against cached JWKS.
Message Queues for fire-and-forget writes (analytics and onboarding return a 202 Accepted instantly).
A Security Scanner compiled to WASM, running prompt-injection detection in under 1 millisecond.
The security scanner illustrates this shift perfectly.
The WASM edge version runs in under a millisecond. Every single prompt gets scanned before it ever touches our backend. We check everything, always, with zero latency penalty.
The power and elegance of this pattern is so undeniable that it made a stubborn Python bigot like me embrace WASM and V8 wholeheartedly. 🤷🏾♂️
The other part of this equation—and the reason it is so transformative—is that you are never more than the distance to the edge away from logic and data.
Depending on your cellular provider or ISP, the time to reach the edge will vary. It will probably be less than 30ms for most places, and certainly less than 5ms with home fiber. You won’t be able to save the trip to the origin 100% of the time, but for a massive percentage of requests, you can compute and cache right there.
You gain a new compute and storage tier with globally consistent latency to clients, without having to manage the deployment and scaling of the underlying services. This is a managed service where the juice is actually worth the squeeze.
It leads to what I think is the simplest, most profound insight of this pattern:
If you have to serve end users all over the world in diverse network weather, why would you ever provision a small ‘n’ of servers in a handful of datacenters when you could instead be at every major internet peering point… for the same or lower cost?
The skepticism this pattern faces usually boils down to outdated heuristics.
People remember when it was impossible to start up a zero-latency worker that had any real capability. Historically, edge functions were dumb, highly constrained, and expensive. It made sense to dismiss them as “just a CDN for caching.”
Today, that heuristic is completely wrong. Modern platforms give you a full V8 runtime, distributed SQL, message queues, and durable objects. The category name (”Edge”) stayed the same, but the capability completely transformed. We are no longer just caching static assets; we are executing relational queries and AI security classifiers. One stores things; the other thinks.
This brings us to the final benefit of the Origin-Edge-Client pattern: it allows you to aggressively simplify your backend.
For years, paying a massive premium to a major cloud provider for a managed, multi-region database made total sense. The alternative was hiring a dedicated DBA to handle patching and failovers across continents.
But when the Edge handles your global routing, cached reads, auth, and absorbs all the malicious traffic, your Origin takes a fraction of the load. It doesn’t need to be a massive, over-engineered Kubernetes cluster. The origin can become incredibly dumb and heavily commoditized. It only ever sees clean, authenticated, sanitized traffic.
In my prototype stack, I use a $30/month Hetzner VPS running PostgreSQL in a Docker container (with automated backups) acting as the single origin behind the edge layer. Paired with the edge handling the frontlines, this commodity origin flawlessly handles a vector store, a 16-layer security stack, and multi-model AI orchestration. Total cost: less than what a traditional startup pays for one day of managed database hosting. The point is not the price tag—it is the architectural pattern. When the edge absorbs the complexity of global distribution, the origin becomes radically simpler regardless of scale.
About once a decade, there is a fundamental swing in how we build software. Mainframes to minicomputers. Client-server to the web. On-prem to cloud. Each transition followed the same arc: the new paradigm looked like a toy, the old guard dismissed it with legitimate-sounding objections rooted in real experience, and then the economics and the physics made the outcome inevitable.
What makes this particular transition different is that it is not just about where compute lives. It is about what the network itself is becoming, and why AI (in the myriad physical and ethereal forms it’ll take) requires this particular evolution.
This pattern extends beyond network architecture. If the defining insight of the Origin-Edge-Client model is that proximity between data and compute determines performance, then that same principle should govern how we design the software that runs on it—including AI systems.
At Kimono, we took a bet that most teams haven’t: we treated performance as a first-class design goal for our agentic AI platform. Many folks naively assumed that any call to an LLM would be slow and never considered latency a solvable problem. We realized that agentic systems would be bottlenecked not by model inference, but by how fast we could move information between agents and parallelize them. So we built our security service in Rust—to scan absolutely everything, in practically imperceptible, even when chained, microseconds.
This is not a quirk of our stack. It is a reflection of something deeper.
Intelligence—biological or artificial—is fundamentally about the bandwidth between memory and compute. This is true at every level of the hierarchy. On a chip, it is the bus between cache and ALU. In a datacenter, it is the fabric between GPU and storage. In the brain, it is the density of synaptic connections between neurons. The corpus callosum—the largest white matter structure in the human brain—exists for no other reason than to provide massive bandwidth between the left and right hemispheres. It contains roughly 200 million axonal fibers, and when it is damaged or severed, you do not lose memory or processing power in either hemisphere. You lose the ability to coordinate them. Intelligence degrades not because the compute is gone, but because the bandwidth is.
The pattern is always the same: the speed of thought is gated by how fast you can move information to the thing that processes it. It’s why (if there was a “why”) neurons do both, I reckon?
One of the more interesting announcements during NVIDIA’s GTC keynote this week was the new Vera CPU, where memory bandwidth was an explicit design consideration. Of course it was. It was always going to occur to the HPC, graphics, and AI company that agentic systems would be bottlenecked by bandwidth. This is the company that built NVLink for exactly this reason. And on GPUs—those export-controlled wonders of brute compute—each generation’s new SMs take top billing with transistor counts and TOPS figures, but what takes up all the power? Getting the bits to and from those beasts.
One of the most useful patterns in computer science is caching—adding a faster, smaller tier of storage closer to compute. Look at the recent gains in CPU performance driven by liberally increasing on-chip caches. The pattern works because proximity is speed.
Now extend that principle beyond the chip, beyond the rack, to the network itself.
There is a useful way to understand two fundamentally different philosophies of building systems.
Intel was, for decades, a Moore’s Law company. Make the transistor smaller, faster, and cheaper, and trust that software would absorb the gains. You didn’t need to rethink your architecture. You waited eighteen months and your same code ran faster.
NVIDIA has always been an Amdahl’s Law company, and they wear it on their sleeve. Jensen said it: the idea that you would use general-purpose computing and just keep adding transistors is “so dead.” Amdahl’s Law says the maximum speedup of a system is limited by the fraction of work that cannot be parallelized. You cannot optimize your way out of a bottleneck by making the non-bottlenecked parts faster. You have to restructure the system to eliminate the constraint. As Jensen frames it: you want to accelerate every single step, because only in doing the whole thing can you materially improve the cycle.
This is why they don’t just ship faster GPUs. They build NVLink, DGX, and InfiniBand because at and across every scale there is a bandwidth wall between compute and the next piece of compute, and their entire business is finding those walls and tearing them down.
The most consequential technology companies tend to think this way. When Dean and Ghemawat published the MapReduce paper in 2004, the insight was not “use more servers.” It was that search could be restructured as a parallelizable operation across commodity hardware, turning the coordination layer into the product. The serial constraint everyone else accepted (you need a bigger, faster machine) was reframed as an architectural choice that could be eliminated. The breakthrough is not a better component. It is a refusal to accept the existing serial constraint as permanent. It’s a strategic advantage conferred to teams who value understanding clearly below every abstraction, so the abstraction works for them, not whomever is renting it to them.
Intelligence runs in parallel. Intelligent systems need to be seen through this lens to be optimized for it.
Moore’s Law is about making the component better. Amdahl’s Law is about making the system faster, and the biggest gains come from fixing the part you weren’t looking at. This is the most useful lens I know for systems architecture, and it maps directly onto the Origin-Edge-Client pattern. Making your origin server faster is Moore’s Law thinking: improve the component, hope the system benefits. But when your users are scattered across the globe on unreliable networks, the origin is not the bottleneck. The network transit is the serial fraction. No amount of origin performance will fix the physics of a 150ms round trip through a congested last mile. The only way through is to restructure the system so the bottleneck itself disappears.
The modern network is not a bunch of dumb pipes. It is practically all software-defined. Every hop is a programmable computer. With AI workloads flowing through it, the network is becoming something closer to a living, adaptive system—routing, caching, and computing at every tier based on the demands of the moment.
And the demands are about to get dramatically more intense. We are heading toward a world where your phone is simultaneously pushing work to a local on-device model for instant inference, to an edge node twenty miles away for low-latency orchestration, and to a hyperscaler origin for the heavy multi-billion-parameter reasoning. AR applications, real-time voice agents, and spatial computing will require all three tiers working in concert. 5G and edge compute are not nice-to-haves for these experiences—they are physical prerequisites. The edge may even migrate closer to cellular base stations to shave off the last few milliseconds.
Think about what that means for a single device. Your phone will not be making one connection to one server. It will be routing requests to the same company’s services across multiple physical tiers—edge, origin, maybe multiple origins—simultaneously. A single user action might fan out across different compute layers, different geographies, different hardware classes, all stitched together by an intelligent network that knows where to send what.
This is why I think it pays to deeply understand all the parts of the stack. We are not just adding a caching layer to the internet. The network itself is becoming a tiered compute architecture, with intelligence distributed across every layer. More neurons and myelin than pipes and tunnels.
Jensen said NVIDIA participates in every vertical it serves—robotics, self-driving, enterprise AI—because AI software is evolving so fast that it changes the hardware requirements, which in turn changes the capabilities and architectures of the software. I believe this era of rapid flux up and down the stack will confer significant structural advantages to teams with the ability to operate from first principles and reimagine entire systems.
The network is the lifeblood of intelligent systems. It was never going to stay static while everything above and below it transformed. AI did not just add new workloads to existing infrastructure—it changed what infrastructure needs to be.
At every transition, the new paradigm gets framed as a toy. Edge is just a CDN. It’s just for auth. These are the exact same dismissals we heard about the cloud in 2009 and containers in 2014. I try to be honest about this instinct in myself, because I have been wrong at every one of these junctures. The cloud felt deeply wrong when I was running bare metal. Containers felt wrong when I had mastered VMs. Serverless felt wrong because Lambda never felt fast enough. Each time, my discomfort was real, and my intuition was outdated.
The hardest skill in engineering is not mastering a new tool. It is recognizing when your hard-earned intuition has become a liability.
But there is a corollary to that, and it is the thing I want to leave you with: the engineers who navigate these transitions best are not the ones who abandon everything they know. They are the ones who understand their own stack deeply enough to recognize which principles are permanent and which are artifacts of a temporary constraint.
The speed of light is permanent. TCP congestion control is permanent. The principle that proximity between memory and compute determines the speed of thought—that is permanent.
The assumption that your origin server needs to be a fortress, that edge functions are toys, that you need a fleet of managed services to serve a global audience—those were artifacts. They were true once. They are not true now.
Things that were physically impossible for small teams five years ago are now radically accessible. Today, a small team that understands the physics can build a globally distributed AI platform—sub-millisecond security scanning, edge-resident data, commodity origin—that would have required fifty engineers and a seven-figure infrastructure budget not long ago. The Origin-Edge-Client model is not a neat trick for bypassing bad venue WiFi. It is a profoundly simpler pattern for modern software: separate the fast, messy, global reality of user interaction from the slow, durable platform of your core data, and let the intelligent network mediate between them.
That gap between what is now possible and what most teams still believe is possible—that is where the next generation of transformative systems will be built.
Sutha Kamal is Chief AI Officer at Kimono, where he architects multi-model AI systems on edge-first infrastructure. His career spans nearly three decades at the intersection of compute and real-time systems—from neural networks and computer graphics in the late ‘90s, through VR, mobile games, and wireless data platforms, to distributed cloud infrastructure at scale. Today he builds agentic AI platforms where security runs in Rust at microsecond latency and intelligence is distributed from origin to edge to device. He writes about what happens when the architectural assumptions of the last era meet the physics of the next one.
