Managing Machines at Spotify
labs.spotify.comWhat happened to your DNS data? Did you switch to dynamic DNS based upon database data? You talk about how much a burden the manual DNS information was, but then you don't specify how you actually solved it using "automation." Is it all dynamic? Everything use SRV records that have TTL's or are added and removed?
Sorry for so many questions, but you made a big deal about how manual "DNS curation" was a bad thing, then glossed over the solution.
Manual DNS is a PITA, but there are a lot of providers that make it scriptable with an API. For example, I use LogicBoxes.
The greatest public DNS feature since sliced bread is Joyent's new CNS. Tag instances and they are available instantly through a CNAME. It's like the public equivalent of running Hashicorp's Consul. Freaking fantastic and makes me really glad I've stuck with the JPC for my infrastructure.
Look at the section "DNS Pushes".
Post author here. A bunch of stuff was glossed over as the post was more focused on the stack's history and evolution than specific technical details.
Ideally we hope to provide some followup posts that go deeper into technical detail about key pieces of the stack (DNS, initramfs framework, job broker, GCP usage, etc).
It's my hunch that they fixed "curating DNS by hand" with service discovery in another method. If you're just using DNS to name internal servers, big whoop. Unless you're using those DNS names for service discovery. Then you've got a big problem. But putting those things in a DB doesn't magically solve it.
It sounds like they went to another method for service discovery, then created DNS entries from a DB either dynamically by registering in a zone or just a regular trigger pulled on DB update. Either way, it sounds like they moved the scary stuff to another level/service in the stack.
Also, linters exist for DNS and can be automated even with manual edits. Jenkins + gerrit makes easy work for this.
> Spotify has historically opted to run our core infrastructure on our own private fleet of physical servers (aka machines) rather than leveraging a public cloud
One has to wonder why they would opt for this. The entire story is a textbook example of where using a cloud would have been immensely better. Instead of leveraging mature pubic cloud offerings, they chose a path that evidently required huge amounts of developer time and resulted in a tremendous amount of pain/wasted time for downstream developers, only to scrap it in the end when they finally realized there's no point in trying to re-implement AWS/GCE. Think clouds are expensive? I'd love to quantify the number of wasted developer-hours resulting from this decision to use physical servers and see how it would stack up against even a very expensive AWS bill.
Depending on their workloads, running their own datacenter(s) might save them millions a month. I know it does in my own case. That said you give up flexibility for the $ savings. They may think the additional flexibility is worth the cost differential at this point.
Would be good to hear their perspective on this.
I believe we expect moving to the cloud to be more expensive than running our own DCs as you suggest, but I don't believe that takes into account any 'wasted developer time' you might factor into this.
I believe we started building this platform when AWS was very new, and hadn't seen a compelling reason to transition from it to the cloud until now. There's a couple of posts with more details behind our decision to go to GCP, but primarily it was to leverage their data tooling.
Network costs. Bandwidth costs is where the cloud stops making sense. Even Netflix hypes up their cloud use, but they don't serve videos from there.
> While we heavily utilise Helios for container-based continuous integration and deployment (CI/CD) each machine typically has a single role – i.e. most machines run a single instance of a microservice.
It's strange to me that this is still so common. My theory is that the "one machine one port" philosophy is still built into a lot of software (monitoring, the ELB, etc). Another is that this is the philosophy we've always known.
Take a look at Kubernetes. Everything is accessible via localhost:<some port>. that breaks most home-built and enterprise orchestration and monitoring tools spectacularly even though it's a much simpler mode (everything is a port, not ip port combo).
Density is much easier to accomplish on larger machines with more cores, which are elastic in the face of bursty residents. They are also generally cheaper per compute/memory.
All of those things are doing gymnastics with ports because nobody can be bothered to ship IPv6. If you can bring v6 up you can assign every process an IP and start assuming ports (80 is the service via HTTP, 443 via TLS, 8080 via HTTP/2 gRPC, 9000 for monitoring, and so on). It's way cleaner than all the work around ports in the current state of the art and means you can Just Use DNS in a number of scenarios. There are whole systems around ports in pretty much every orchestration system and it's such an antipattern, really. Half of Docker's networking stack, a bunch of Kubernetes logic, Flannel, all of it becomes unnecessary and they represent attempts to jam the right way into limited IP and limited address table space on infrastructure.
IPv6 is practically built for containers, and, to Kubernetes's credit, they architected with that in mind. (Learned from BNS.) Weirdly, what I'm saying here was the original idea behind ports in the first place. There just aren't enough of them, particularly when half your space is shared with client sockets.
I want a world where v4 is pretty much just my control plane into the v6 cluster, since I'll die before IPv4. Google and far more importantly Amazon need to come up with a v6 story in their cloud offerings already. AWS has had a decade. This isn't just blind advocacy any more; the orchestration and software side is starting to build entire parts of the OSI stack because the network side of our industry is stuck without any sign of moving, no matter how dire the v4 situation.
It's strange (or perhaps rather unfortunate) to me as an SRE at Spotify as well. Helios is in many ways similar to Kube, so it was our hope that eventually it would lead us to scheduling multiple service containers per physical machine. We certainly have the service discovery framework to support that model.
However given Spotify's business position our priority has yet to shift from providing engineers compute capacity as fast as possible to optimising our usage of said compute capacity. It's all now somewhat of a moot point as we move away from our own hardware into Google's cloud.
But less predictable in terms of overall performance due to all the shared components.
It's also harder to separate two or more processes that 'grew up' together in the same container/machine/vm.
Also, this whole solution sounds like a Linux clone of Microsoft's Automated Deployment Services -- way ahead of its time and under-appreciated in 2004.
One more: How is relying on a random Python library from OpenStack better than relying on a UNIX command line tool that's used by 100x as many people?
"We also assigned each server a static unique identifier in the form of a woman’s name – a shrinking namespace with thousands of servers."
Let me fix that for you; stop gendering your servers.
If they'd instead said "We assigned each server a static ID in the form of a female computer scientist's name", we'd be here praising them for their forward-thinking inclusion. Maybe lets not see everything as offense-worthy?
FWIW a team of largely men using womens' names for servers always felt a bit icky to me personally for reasons I couldn't quite enunciate. We now use ungendered serial numbers.