Building a global deployment platform is hard – a tour of the trade-offs we took

45 points by nicoche 2 years ago · 13 comments

Reader

There's actually no such thing as a multi-region anything. There also isn't such a thing as a single-region anything.

Cloud infrastructure is built and managed as a sort of bizarre multi-layer cake, that is actually several cakes. It's like Escher's Cake. Multiple cakes cut into slices and pushed together, where one cake bleeds into another cake seamlessly. Every layer is actually more than one layer, and not.

The way this works is pretty simple: cloud architecture is built on many different services. Those services may operate either on a "global" level, or "regional", or "zonal". By combining different services together into one solution, your solution will be affected by global changes, and by regional and zonal changes. So there's no way to completely isolate it from other areas, even at the zonal level. At some point, a change to a global, regional, or zonal service will affect whatever zone/region/multi-region thing you have, no matter how you architect it. In the cloud, anyway.

(There would be if every single service in a cloud provider could be zonally segregated, but they can't. If you make an IAM change, that's global; no way to make it only affect one zone at a time, and no way to ensure it is applied in every zone. DNS is intended to be global; you could technically run your own nameservers and target specific ones in specific zones for a particular update. But even then you could not guarantee an update goes out to all zones, or even one zone)

If you built an entire cloud provider from scratch, you could try to build literally every service to be zonal, and expand from there. But no cloud provider I'm aware of has that capability, and I'm not aware of anybody who has DIY'd their own PaaS that does it either. So you can claim you have a global platform, or a regional or zonal platform, but in practice, it is almost always an architectural Escher Cake and subject to the random whims of the universe.

nicocheOP 2 years ago

I completely agree with you . I would even go further: even if you can, you probably do not want to build a set of fully zonal services. As you said, there are some features that you want to provide which should affect "global" entities, like IAM or billing. For this kind of stuff, I believe that it makes sense to have some simple, globally unique components -and carefully plan around their failure scenarios

__turbobrew__ 2 years ago

You can get a poor mans multi topology (dc, region, az, rack, etc) system by splitting a kubernetes control plane and etcd cluster across that topology. For example if you have 5 dcs you can run a control plane and etcd instance in each dc and be impervious to 2/5 dcs failing.

The performance of running etcd clusters across high latency links (100-200ms) is not as bad as I would have originally thought and scales into tens of thousands of pods without too much trouble.

As long as you label kubelets with topology information, deploying selectively in different topologies is as simple as specifying the correct node selector. Workload high availability can be ensured with topologyspread to make sure not too many pods of a single workload are running in the same failure zone.

If you can go k8s native with your infra this stuff becomes easier, but by no means easy.

Sparkyte 2 years ago

Lots of major companies still today don't have a global deployment platform. A lot of stuff is still very centralized because the complexity of managing infrastructure multiplies with the complexity of the existing infrastructure.

ushakov 2 years ago

Do they actually need one? With replicated deployments you will want a globally replicated database as well. This makes operating your app n times expensive and time-consuming where n is the amount of locations
- Sparkyte 2 years ago
  
  Correct what most places did when I was working for a pretty global place, they'd have their central stuff. But where it needed to be closer to client it would be like servers handling video and voice streams.

ushakov 2 years ago

Why would I use this instead of Fly.io or AWS Global Accelerator?

nicocheOP 2 years ago

I think that we provide a higher level experience than Fly. Regarding AWS global accelerator, I haven't tested the product, but from what I see we also provide a higher-level experience: we take as input a GitHub repo or a container image, so we can abstract away the VM layer, EC2 - we directly run "applications", not virtual machines
- ushakov 2 years ago
  
  Experience is subjective. People seem to be fine with what Fly offers already. If the high-level experience was a requirement, nobody would be using Fly today. And enterprise customers, who need multi-region already have a solution or seek a solution based on their own cloud (AWS, Azure, GCP all have multi-region offerings)
  - nicocheOP 2 years ago
    
    Maybe experience is not the right term; "offering" would probably be better (or "more managed experience"). Although some people are perfectly happy with what Fly offers, it's probably less accessible to some others -and they hence do not use it.
    imo Fly targets more techy users than Koyeb. For example, to setup continuous deployments, you are expected to setup a Github Action (https://fly.io/docs/app-guides/continuous-deployment-with-gi...). If you know what you are doing, it's fine. We want to provide a more managed experience by embedding this kind of capabilities in the platform.
    So I would say that Fly is more flexible while we are more accessible

bithavoc 2 years ago

I feel like Koyeb is bound to make the same mistakes as Fly.io which is moving away from global CDN-like app deployments to regular 12-factor app hosting

ushakov 2 years ago

It’s not a mistake, it’s what the customer base wants.
The problem with Koyeb is that their business is essentially a niche feature (multi-region). The main selling point of Fly and Cloudflare Workers is not that they’re multi-region, but that they are cheap and easy for developers. It’s a very bloody business to be in, if you ask me…
- bithavoc 2 years ago
  
  Agree with most of what you say. To clarify, I think is a mistake that Fly.io already discovered that the TAM of multi-region and Edge container platforms is tiny compared to a regular 12factor PaaS which already implies easy but not necessarily cheap, look how Render started cheaper than Heroku and now it basically matches Heroku pricing or it could be even more expensive in some scenarios specially when you consider that Heroku apps feature soft-limit of 2TB of bandwidth vs 100GB in Render and Fly.io
  And yes, is not an easy business these days.

Settings

Building a global deployment platform is hard – a tour of the trade-offs we took

Keyboard Shortcuts