Running Databases on Kubernetes
questdb.ioI don't think the upsides are worth all the work.
You can spend a lot of time getting databases and other stateful workloads to work -- mess around with StatefulSet and PVC on top of all the normal Kubernetes concepts, and what do you get in the end? Are you really better off than you would have been if you ran the database in EC2?
Plus, "herds not pets" kind of breaks down once you start using StatefulSets and PVCs. Those things exist to make Kubernetes more like a static environment for workloads that can't handle being run like ephemeral cattle. So why not just keep using your static environment?
If Kubernetes is the only workload management control plane you have, then I guess this makes sense. But if you are already able to deploy your databases with existing tools, and those existing tools don't really suck, it's probably not worth migrating. It would take a lot of time and introduce significant new risks and operational complexity without a compensating payoff.
yeah but if your org has orchestration tooling built around k8s, in a way it becomes much easier to provision a DB with k8s, setup the service, routing, networking, roles, etc than it would be in terraform. especially if you have to repeat this process in like 20 envs (stage, prod) x multiple regions
This sounds dangerously close to "yeah but if the only tool you know is a hammer..."
If (big if,) you org orchestration supports stateful sets.
Where I was the tooling was very focused on disposable api servers.
Can’t you just use this then: https://aws.amazon.com/blogs/containers/aws-controllers-for-...
If you want to use k8s as your dataplane, sure. Though I'd rather entrust that task to Crossplane.
There's always a point at which running things on Kubernetes becomes worth it. RDS is an expensive service, and if you want to enable every developer to boot up a development environment, or several even, if becomes prohibitivly expensive to use one process for everything. This is also true about running any workloads on Kubernetes to begin with though, and many companies definitely would be better off with a service like ECS/FarGate/CloudRun/Fly.io. Especially if they don't need the flexibility to build their own addon metrics/secrets/logging stacks.
It's the same RDS though. These controllers enable you to manage the lifecycle of the exact same AWS services you get when using the AWS console or CLI. Kubernetes is the porcelain and AWS offerings are still the plumbing.
+1 Sometimes just because you can does not mean you should.
What if you have finance customers who don't like commingled data, and you want to sell them a service and tell them with a straight face that their tenanted database isn't one bad query from serving up their data to someone else?
You can still have separate ACLs, databases, tables and even row level access control even if you share database servers.
You can also have dedicated database servers per customer on bare metal or VMs. It's not always more work than asking your database team to learn Kubernetes...
Speaking as someone who has sold Saas into banks numerous times, "no shared tenancy" is very frequently an absolute and non-negotiable requirement.
Fair enough; as 'twblalock says, it doesn't have to be more arduous to set up database servers per seat outside of k8s.
I simply wanted to highlight that there are many ways to skin a cat, and "no shared tenancy" is not in itself a valid argument for hosting your DBs in k8s, even though there may be other good reasons.
It's also more resource efficient, especially for non-production or non-critical workloads. VMs only come in discrete configurations and many times even the smallest one is too big, wasting a lot of resource. When you run thousands of instances, thanks to the magic of microservices, the costs add up.
Those are good arguments for ephemeral workloads but they don't make as much sense for databases.
Another turtle.
I've been quite happy with CloudNativePG on k8s. It was simple for me to set up on a k8s cluster with one primary and two replicas, if the primary box goes down another instance becomes primary, deal with connection pooling, and simple to have backups go to a cloud object store. The alternative is dealing with all the replication manually, making sure that your leader election and failover work, making sure you can stand up new PG instances and get things replicated to the new instance, having a service that is checking the health of the database to trigger a failover, etc. It's certainly not impossible or anything like that, but CloudNativePG has been pretty easy. K8s isn't perfect or anything, but it's been a pretty nice experience for me.
I've tried other Postgres operators and been disappointed and it did require a little learning, but it's not like getting replication, Patroni, etcd, PGBouncer, HAProxy, and pgBackRest all running for a high-availability Postgres deployment is easy and wouldn't require learning.
As the author says, "[k8s's] operator model allows end users to programmatically manage their workloads by writing code against the core k8s APIs to automatically perform tasks that would previously have to be done manually." To me, that's the benefit. The operator can handle tasks like adding a replica or failing over the primary to one of the replicas. I could presumably do some of that with other tools on bare metal/VMs (I can always shell-script things), but I've had a good experience with CloudNativePG's operator. Likewise, as the author says, making day-2 operations easier is a big thing.
K8s does have some annoying amount of complexity, but it's been nice overall.
This is really the secret - once someone figures out how to tie all the different k8s concepts into a functioning system, you can just copy it and put it on your cluster and it will probably work. Trying to figure it all out the first time is the messy part. If there’s an operator or Helm chart or something that does what you want there’s no shame in using it!
That's kind of the issue with the db question - for your app logic, 'will probably work' is fine. Playing fast and loose with your database is less enticing - you more likely want to understand every bit of the stack between you and your db, or if not, at least have a support line to whinge at if you hit trouble.
EDB provides commercial support for this
Thanks for the info, I had not seen this one before.
That's just a really really bad write-up on the real problem on running a database on k8s.
You need ha because k8s should run already with automatic node upgrades.
You need a pod disruption budget to make sure it is running and switching over when a node fails or gets upgraded.
You want to either totally Oberprovision on memory or look into keep 2400 to make sure to fine-tune memory before k8s starts to throw your database out constantly.
K8s is not a VM.
If you use k8s and still don't take care of application migration strategies you still don't understand what cloud native means.
There are still other things missing here but still...
Of course excluding hobby people playing with k8s.
Memory and upgrading nodes are the two single most issues will see which disrupts service.
Otherwise k8s is a dream come true.
I still would try to use a db managed if it's critical.
Additional points: Zalando postgres operator is great and shows the real magic of k8s and operator.
Use a helm chart and just bring your own little database for dev test and e2e tests.
You can easily use Auto scaling for node profiles. No noisy neighbors. If your db is too small for normal nodes you don't have a problem anyway.
> k8s should run already with automatic node upgrades
This is difficult to impossible to do with databases; even if your database has a built-in recovery method for when a primary is taken offline, in such a way that allows for zero-downtime in theory, the reality is that such mechanisms depend on the secondary staying online until the failover mechanism is complete. If you turn over control of node upgrades to the cluster provider, the node under the secondary will get rebooted in the middle of the failover process, and you will get downtime at best, data loss at worst. What kubernetes teaches us is that databases aren't tied to the literal VM they're running on (which is now cattle), but rather on the availability of that node. If you run databases on kubernetes, you need to have a mechanism to slow down node upgrades.
Source: helped run hundreds of Elasticsearch and Kafka nodes on kubernetes in production at one point in my career
Online lossless zero downtime upgrades?
I've done it with Cassandra...and yeah Kafka can do it I've heard.
But those can be 30 hour operations even with you ducks in a row, and you better have backup strategies ready.
Fun story, Amazon said rds would be always be zero downtime upgrades. But then came a major version upgrade and .... Surprise it wasn't.
Add a new RDS replica, wait for it to sync, promote it to master? Zero downtime achieved
I didn't do the upgrade, I'm not a postgres MySQL person, but the best they could do was a third party tool that dropped it to a couple minutes.
Your claim about needing the primary online to failiover to the secondary is untrue, at least not for all Postgres operators.
Cloudnative PG rebuilds the secondary during failiover from the streamed WAL to an S3 endpoint. No primary needed.
Kubernetes does have this capability - Pod Disruption Budgets. They're underutilized and under tested, but at least the default cluster autoscaler respects them and will avoid destroying nodes that would break that constraint.
They are pretty well tested as of today (now that multiple vendors respect it during node upgrade), but now they’re relatively under featured for the next set of problems:
1. No way to signal that the workload is ready to accept traffic but not ready for disruption (such as an async background refill)
2. No way to provide backpressure over a time window, vs just instantaneous (ie during a rolling node upgrade PDB doesn’t prevent you from being moved multiple times)
3. No way for an admin and workload owner to coordinate on the amount of disruption a particular disruption will cause and discriminate between light and heavy (restart on node, recreate on node, move across node)
But yes, everyone with a nominally HA workload should have a PDB.
> Use a helm chart and just bring your own little database for dev test and e2e tests.
dev, test, and e2e tests should be done against full-size db clones
> dev, test, and e2e tests should be done against full-size db clones
Real customer/sensitive data should not exist outside prod (and backup). So generally no, not full-size clones. I'd argue instrumentation in prod should give information on performance - for some tests/development you might need prod-size fake data.
Couldn’t agree more. Having full sized and fully speced dev and test DBs is wasteful and not realistic across several independent teams.
Monitoring prod closely and understanding what could constitute a costly workload/query which would in turn require a temporary test env with similar sized dataset is the correct approach.
This is too reactive approach, isn't it?
Imagine a system that could do this in CI/CD pipelines:
1) first, run all the tests on a tiny DB as usual 2) extract queries 3) run them against full-size DB branch/thin clone (thin provisioning, CoW; PII is not there of course, wiped out for security/compliance) -- auto-guessing parms, that's the trickiest part, but assume it's solved 4) collect all the details about performance, focusing on IO numbers (rows, block read/writes) 5) if some queries are off – say, you forgot LIMIT – post a warning, block the change, do not allow deploying it, letting backend dev fix it.
This would be proactive. And it's becoming possible with modern tools.
> dev, test, and e2e tests should be done against full-size db clones
that's cute, what is your "full-size"? I don't have 2 days to run a test, and I'm pretty sure every single compliance requirements we are following would get obliterated the second someone hears about us doing that
Agreed on compliance part – of course, in many cases (not in all though) PII must not be in dev/test envs.
Although, cannot agree with the former part. If your tests are running 2 days on a full size clone, and it's an OLTP case, what about users, do they suffer from long query duration too? It sounds like it's time to optimize queries and/or redesign test sets (or both).
If it's bad in testing, it will be bad in prod. That's the idea of testing.
Example: how do you check schema changes?
You have a tool to keep the structure of data but anonymize it, I’ve seen this a few times in healthcare regulated systems
I am laughing each time I hear this so call anonymous process /data. Each time, there were different ways to link back data or things were badly scraped.
You think I'm going to clone a multiple petabyte database just to run some tests?
not sure about petabytes (yet), but for dozens of TiB, we are fine – DB branching, thin clones, CoW are to the rescue.
The key for me is the level of automation that you can reach at a reasonable "development cost". Let me elaborate.
K8s, if anything, is an API. An API that allows you to interact with compute, storage and networks in a way that is abstracted from the actual underlying infrastructure. This is incredibly powerful. You can, essentially, code and automate all your infrastructure.
But this goes beyond deployment, something you could achieve (more or less) with tools like Terraform or Pulumi. Enter "Day 2 operations".
Day 2 operations are essential for any database. And cloud services have done a good job at automating them. Speaking of Postgres, my daily job, things like HA, backups but also minor and major version upgrades are table stakes day 2 operations.
If you want to build these day 2 operations in the cloud (say on VMs), even though you have APIs do to so, a) they don't implement a pattern like Kubernete's reconciliation cycle; and b) you have a distinct API per cloud. K8s solves both problems, making it way "cheaper" to build such an automation. On K8s, a given operator can code these day 2 operations against K8s APIs. Therefore, if you want to build such automation, either you are a cloud provider (and potentially do this only for your own cloud) or you do it on Kubernetes.
This is so much true, that existing operators have already gone beyond what DBaaS do. Speaking of StackGres [0] (disclaimer: founder), we have implemented day 2 operations (other than the "table stakes" ones that I mentioned before) that no other DBaaS offers as of today, such as vacuums, repacks and even benchmarks (and more day 2 operations will be developed). See [1] for the CRD specs of SGDbOps, our "Day 2 operations" if you are interested.
[0] https://stackgres.io [1] https://stackgres.io/doc/latest/reference/crd/sgdbops/
This article does a great job describing the investment required to pull this off. At HubSpot, my team is running a large Vitess/MySql deployment (500+ distinct databases some sharded, multi region) atop k8s today and had to learn a lot of those same lessons and primitives. We opted to write our own operator(s) to do it. In the end, the investment has paid off in terms of being able to build self service functionality for the rest of the business and write the kinds of tools workflows that allow us to support it with a relatively small team. The value is in the operator pattern itself and being able to manipulate things on a common control plane. Compared to the alternative of managing this with Terraform and Puppet/Ansible/Chef directly on EC2, which I've also done before, it's a better experience and much more maintainable even at the fixed expense of additional training and tooling.
I won't disagree with others that RDS is probably worth it until you need something very specific or have reached a certain scale.
Happy to share tips or pointers for anyone going down this path specifically with MySql or database workloads in general.
The first question which comes to my mind is what are performance implications of running database like you do inside k8s vs EC2 vs bare metal? And how did you solve multitenancy? Does the operator handle lifecycle of database per customer simply or is it something more complicated?
ps. And how you deal with migrations? ps. Forgive me if I'm asking for too much!
No worries, happy to share more details. For the databases where performance is a concern, we use constraints and reservation requests to all but guarantee it will be the only tenant on the node and we actively monitor CPU throttle and will autoscale in cases where it is sustained for a long period of time. We're actually achieving better overall utilization with this setup vs bare metal and arent dealing with a lot of issues with resource contention.
There is a main operator responsible for all the databases. It handles configuration changes, provisioning pods and slowly rolling out changes. In kube we model this with a custom resource we've defined called a KeyspaceShard which represents a named set of database instances that should participate in replication together. Once provisioned, the pods know how to hook up and detach from Vitess without requiring further involvement from the operator. Vitess handles backups and maintains the replication topology. "Complicated" is an apt description of what it does but not "complex". Evicting a database pod and letting the system reschedule and converge is a routine operation that doesn't cause much concern.
Migrations are done with gh-ost, which has its own custom operator that manages the lifecycle of the migration and ties into self service tooling we provide that is integrated with our build and deploy system.
One more question - did you have a chance to see where scylladb is going?
On latest conference their CEO said it's all about serverless and virtualization. Having kubernetes doing all the work [1] - "we are doing it automatically for you with our management which is based on using multi-tenant kuebrentes deployment". Even more surprising is that instead of using NVMEs they want to use s3 for backend storage and NVME is only cache [2] :o I am not database expert, but this is very interesting.
Thanks for answering!
I guess there must be a usecase in missing here, but RDS is working so well for me, it's hard to imagine why I would not shift most of the operational concerns to this competent vendor.
The only thing I can think of is cost. My usage probably isn't high enough where there is any financial benefit to an alternative... but if it was, maybe I'd be considering this.
I mean, yeah, cost is kind of the problem with AWS, especially for large amounts of data. Do your own cost/benefit of course, but for some of us it's a non-starter.
If you have a small project and you only need one or two RDS databases in production, stick with RDS. The cost isn't that high and you save a ton in aggravation. Yes you need to set up separate monitoring, firewalling, etc. but it's really not a big deal.
When you start to have dozens or hundreds of databases in production, and developers asking "I need Postgres in production, why can't I just click a button and get a Postgres instance for my service in production?" then scaling the monitoring and firewalling gets a little more complicated. Hooking into standard Kubernetes monitoring and service meshes can really help to simplify things.
Hi, author here! Over the past 6 months, I've been building a hosted service for a database on top of k8s at QuestDB, and wanted to share some of my thoughts on the topic. I was inspired by the recent twitter discussion led by Kelsey Hightower a few weeks ago. Hope you find it interesting!
I used to work for an org that deployed 3rd party legaltech "apps" on kubernetes which had all batteries included - Postgres, rabbitmq, redis, you name it I have seen it. Running statefulset even with the best operator there with a team of 4 is nothing short of a nightmare. Couple this with stability of rook ceph.
In 2019, every operator had crazy bugs, we inherited all of them. You have to solve not just databases level error but also errors popping from operators. If you can avoid databases on kubernetes, you should just do it.
One of the big problems with Kubernetes in general, especially back in 2019, is the alpha quality of almost everything in the ecosystem. Especially service meshes.
It seems to get worse the further down the stack you go. I’ve seen tons of problems with operators, monitoring tools, and CNIs.
It’s somewhat better now but there is still a lot of stuff you can’t depend on. The CNCF seems to endorse pretty much anything even if it’s crap.
I've recently worked with putting postgres into kubernetes using the zalando operator. The impression has been such a mixed bag that it looks like we need to start over with some other operator. When we run into problems the documentation, error messages and configuration structure has been quite cryptic.
Does anyone have any specific recommendations on what to use (like which operator) when setting up a postgres cluster on k8s, specifically for standby replication?
I would look at https://kubedb.com/ - operators are a mixed bag - but a bad operator can be a painful intro to K8s, that's for sure.
What type of issues did you run into?
We've been running Zalando Postgres Operator for all our prod and dev clusters (around 100 in total) and couldn't be happier.
My impression is that when it works it works well but when it doesn't it doesn't help you that much. We have had two main issues:
- Can't set up two separate clusters in the same kubernetes instance because some cluster specific configuration is inexplicably set globally in the operator.
- Documentation and error messages are cryptic. Have to do a lot of trial and error to compensate for that. Maybe the issue here is a lack of experience with the stack used. Like Spilo and Patroni.
the last time I gave the Postgres operator space a serious look was about a year ago, and at the time the Zalando operator was far and away the most feature complete and mature.
We had a couple unusual requirements that the operator wasn't really suited for, so we ultimately ended up writing our own helm chart and forgoing the operator route altogether
Not to diminish the product that QuestDB is working on, but another solution that works very well with Kubernetes is Vitess. Vitess is basically sharded MySQL, but it automatically manages this very well and has built in kubernetes support so it really handles the "pets to cattle" thing well.
> "pets to cattle"
That phrase is inaccurate. With the cloud and K8, the pets move from being software that is tightly tied to the hardware to being a collection of configurations and software that are tightly tied to themselves.
We just make the actual physical hardware anonymous. But from the perspective of the actual stack, there is still a server with its cpus, filesystem, i/o and everything.
"Pets that you can carry" is more like it.
I really like this analogy
> K8s has an extensible Operator pattern that you can use to manage your own Custom Resources (CRs) by writing and deploying a controller
I have seen it fail way too many times. Inspecting a failing deployment that now has some magic Go code someone wrote running on this cluster. I can see using the basic kube building blocks: deployments, pods, config maps, etc.; there are enough guides and tools to help you out. As soon as you start writing code that runs in there, you're now dealing with two problems: your actual thing you're deploying, and now the operator.
Well, and then you need a mesh, and a way to manage certificates. and if it's a database to manage all the volumes. Everything looks good at the architect level - all the boxes and arrows line up, but when it breaks in production it's a nightmare to debug.
StatefulSet and PVCs aren’t sufficient to fully handle all the likely resilience challenges of running a database cluster on K8S. There needs to be some rethinking on how StatefulSet works to make it more appropriate to this use case, such as allowing Pods to be started out of order when recovering from failures.
I worked in this problem space extensively until 2020, and I think that there are paths forward but they require changes in K8S that none of the folks involved seem motivated to make. Realistically to make databases in K8S work well today you need a database built for K8S rather than one adapted for K8S.
The building blocks present today are not fundamentally capable of building a positive UX for adapting existing databases to K8S, but this is something that is worth making possible and I hope the community gets there some day.
Re out of order:
Is https://kubernetes.io/docs/concepts/workloads/controllers/st... unsuitable for that?
Unfortunately, while rolling updates account for some scenarios, they are not sufficient for handling out of order restarts where the order cannot be pre-determined. There’s probably some hack you could build with partitioning to mostly address the cases I am thinking of, but it isn’t elegant or guaranteed correct.
This will be a problem for any database where clustering is synchronous and a specific primary node must start first on a full cluster restart. There are other out of band hacks you can do with reassigning PVCs, but it’s never elegant in the current primitives provided.
During my work in this problem space I became convinced that primitives for stateful applications in K8S were built specifically without considering databases as a valid use case. Everything else is just hacks after the fact to make it “work”.
Since I helped design them, I take some issue with that :). Certainly we never expected they would completely solve problems for the database, but they were definitely intended to provide guarantees that simplify normal consensus operations and prevent accidental confusion with non perfect databases.
If a specific primary must start first, that’s partially what ordinals were intended to allow (0 is your primary, the others are always 1-N, and kube is responsible for ensuring the primary is never reassigned). I’d love to take feedback about places where the primitives are unusable, or ways they can be improved, because there are always new tools to add.
I am not working in this problem space any longer, but it’s likely we’ve crossed paths. I have previously presented some suggestions to the Storage SIG on this topic. Feel free to reach out to me with the info in my profile and I can get back to you with a more detailed write up on the specific challenges that I would like to see addressed in StatefulSet, however I am currently traveling so my response will be delayed.
>"I have previously presented some suggestions to the Storage SIG on this topic."
Ia there any chance you might be able to provide a link to these? I would be curious to take a look.
You are talking about things like the "All Nodes Go Down Without a Proper Shutdown Procedure" example in https://galeracluster.com/library/documentation/crash-recove... right?
To handle this case, some teams may have a manual runbook, some teams may have some automation with ansible, some teams may have nothing. So, if someone can come up with some hacks and package that into a k8s operator, it is still a win. It seems to be the best primitives we have at the moment.
That is exactly one of the scenarios I am thinking about. Yes, there are hacks with PVCs that make recovering from this possible today with StatefulSet.
In that scenario it looks like members must coordinate to identify the highest committed transaction (identifying the list of valid members) and then bootstrap from that member?
Stateful Sets were designed to standardize two hard problems: being able to identify all the valid members (pods identified by number that are running at most once on any node) and give admins the button to decide a member was never coming back (the force delete pod / force delete pv action). That was true black magic before - everyone did it their own way. So we worked with the ecosystem to enable vendors/communities/individuals to map those primitives into specific solutions, but did it somewhat deliberately as “we have to build this together”.
What I think the gap has been is that there is significant friction in between the three realms of expertise - knowing what kube is providing, knowing how to map that to a specific problem like translating the Galera runbook into operator/script logic, and then communicating that to the teams that will be accountable for reacting. Vendors have incentives to make you pay for that expertise (or may not have it), large organizations hire people to provide it (most large db on kube deployments are also large tech companies), and in between you have a lot of uncertainty and knowledge gaps that doesn’t necessarily transfer, and that is what drives “Kube isn’t great for stateful”.
It’s ironic to me because StatefulSets were intended to take advantage of those incentives to help the ecosystem scale, and the result is “worse is better” in that many more people can do state on HA DBs than were successful on VMs or metal, but it doesn’t mean they’re completely successful and when people hit the rough edges it hurts. We can do better (that’s partially my day job), but there is a lot of pain people have taken so far that was probably unnecessary. You should use managed DBs if you can - and when you don’t, Kube should be the best alternative it can be (which isn’t far - many DB SaaS uses some kube), and that’s what we need to focus on.
>"I worked in this problem space extensively until 2020, and I think that there are paths forward but they require changes in K8S that none of the folks involved seem motivated to make."
Can you say what you see as those possible paths forward and what changes they would require?
As a relative newcomer to k8s I was a bit surprised at the lack of backup tools available, coming from the world of on-prem Veeam which had more features than I knew what to do with. In my current role we had to find a way to back up our Postgres DBs running on k8s. We started using Kanister to actually take the backups but found there wasn't much around to actually manage the backups' lifecycle. I ended up writing Taweret (https://github.com/swissDataScienceCenter/taweret), a small tool which just ends up interacting with the Kanister CRDs to delete backups we no longer require based on a defined backups strategy.
But that's what k8s is. It's not a tool that does a thing, but rather a set of APIs and patterns that let you glue together many tools that will let you a thing (for better or worse)
Sure, I meant there wasn't really much around which ran on k8s to manage backups.
We ran Zalando Operator for Postgres in k8s for a year, until finally succumbing to its technical debt that leaks out from every bit of its software being.
After switching to the Chrunchy Data pg operator v5 on k8s, we've had close to zero problems - one or two times a year the log shipping / HA replication fails and we have to restart it, but it's really neat! I can *warmly* recommend it; it really is CloudSQL in K8S.
kuberDBs seems like an unnecessary complication