Zero downtime migrations at petabyte scale (2024)

112 points by Ozzie_osman 3 months ago · 43 comments

Reader

While this is cool and I dig it, I'm really, really thankful for maintenance windows at the current job. In the real world, 99.9% of systems aren't used 24/7/365. Just do the cutoff when everyone is asleep. Then restart everything to be sure.

embedding-shape 2 months ago

> In the real world, 99.9% of systems aren't used 24/7/365. Just do the cutoff when everyone is asleep
"Real world" being something that covers max what, 10 hours of a day? What about things that are used by the entire world? I think there is more than you realize of those sort of services underpinning the entire internet and the web, serving a global user base.
- MagicMoonlight 2 months ago
  
  Almost nothing in the world is used globally. You have a handful of things like YouTube and Facebook and the visa network.
  Nobody is using slopwork’s new CrudX at a global scale.
  - aloha2436 2 months ago
    
    The Visa network is the frontend to a truly staggering number of issuers who also want to maintain a similar level of uptime to support their cardholders wherever they are in the world.
  - sanswork 2 months ago
    
    Basically every large multinational corporation will have a bunch of systems that are used globally. Most advertising companies work on global traffic patterns.
    
    citrin_ru 2 months ago
    
    A large multinational corporation can go a long way by splitting they IT infra into multiple regions and doing maintenance in different regions at different time.
    
    agnivade 2 months ago
    
    This idea sounds nice, but there's a high maintenance cost to this.
    - How will you maintain multiple deployments across multiple regions in the world? Backups, security patches will start to take a toll. - How granular is the right split? Not every country has a cloud provider. Then you need to start thinking about regions and office timings and then it starts to get all blurry.
    
    mystifyingpoi 2 months ago
    
    > How will you maintain multiple deployments across multiple regions in the world? Backups, security patches will start to take a toll
    The same way as always - by automating the crap out of it.
    > How granular is the right split? Not every country has a cloud provider.
    Doesn't have to be one deployment for one country, does it? Having like 3 or 4 deployments across the globe already gives you (at least) 3-4 hours of inactivity window, let's say 1 am - 4 am or something.
    
    mystifyingpoi 2 months ago
    
    Exactly, that's how you do it. Having one system for the whole world is risky.
  - lll-o-lll 2 months ago
    
    >> Almost nothing in the world is used globally.
    ??? I’ve worked in this software game for over 20 years. I’m yet to experience this “no need to worry about the globe”. I think you have the fallacy of thinking local experience is general experience.
    There is a very large amount of b2b software out there that is serving multi-nationals of all types. Perhaps it is surprising, but there’s a large number of software solutions that aren’t that big, but still have customers in all the 4 corners.
- mystifyingpoi 2 months ago
  
  > What about things that are used by the entire world?
  Well, for the remaining 0.1% - go ahead and use the fancy hot replication thingy. Sometimes there is no choice, and that's fine. Although that might mean, that the system architecture is busted.
ayuhito 2 months ago

> Just do the cutoff when everyone is asleep.
In this age, many smaller companies serve customers across the globe. There is no common “asleep”.

Thaxll 2 months ago

We need more details on 6. This is the hard part, like you swap connection from A to B, but if B is not synced properly and you write to it then you start having diff between the two and there is no way back.

Like B is slightly out of date ( replication wise ) the service modify something, then A comes with change that modify the same data that you just wrote.

How do you ensure that B is up to date without stopping write to A ( no downtime ).

mattlord 2 months ago

It's open source. You can get as many details as you like :)
https://github.com/vitessio/vitess
https://vitess.io/docs/reference/vreplication/
https://vitess.io/docs/reference/features/vtgate-buffering/
Kaliboy 2 months ago

Not sure how they do it, but I would do it like so:
Have old database be master. Let new be a slave. Load in latest db dump, may take as long as it wants.
Then start replication and catch up on the delay.
You would need, depending on the db type, a load balancer/failover manager. PgBouncer and PgPoolII come to mind, but MySQL has some as well. Let that connect to the master and slave, connect the application to the database through that layer.
Then trigger a failover. That should be it.
- Snelius 2 months ago
  
  > Load in latest db dump, may take as long as it wants.
  400TB its about a week+ ?
  > Then start replication and catch up on the delay.
  Then u have a changes in the delay about +- 1TB. It means a changes syncing about few days more while changes still coming.
  They said "current requests are buffered" which is impossible, especial for long distributed (optional) transactions which in a progress (it can spend a hours, days (for analitycs)).
  Overwall this article is a BS or some super custom case which irrelevant for common systems. You can't migrate w/o downtime, it's a physical impossible.
  - freakynit 2 months ago
    
    Feels the same to me as well.
    "Take snapshot and begin streaming replication"... like to where? The snapshot isn't even prepared fully yet and definitely hasn't reached the target. Where are you dumping/keeping those replication logs for the time being?
    Secondly, how are you managing database state changes due to realtime update queries? They are definitely going in source table at this point.
    I don't get this. Im still stuck on point 1... have read it twice already.
    
    mattlord 2 months ago
    
    It's open source. If you want to understand exactly how, you certainly can! :-)
    https://github.com/vitessio/vitess
    https://vitess.io/docs/reference/vreplication/
    https://vitess.io/docs/reference/features/vtgate-buffering/
    
    Snelius 2 months ago
    
    He can't. It's not a reference, just a bunch of CLI examples. Please learn what is the reference. Even docs is a BS, wonderful product. Overall this article is a typical advertising and clickbait..
    
    joshuamorton 2 months ago
    
    The code is open source though, you can read it. The cli examples point you towards the relevant bits of the actual database code to read.
    For my own sake, I'm not sure what is so surprising here. "Turn up a hot second replica and fail over to it intentionally behind a global load balancer." Is pretty well trodden ground.
    
    Snelius 2 months ago
    
    > The code is open source though, you can read it
    Thank you! :D
    > Is pretty well trodden ground.
    YES!! But the article point us to it's a 400TB+ w/o downtime migration. This is impossible. That why is looks like clickbait and advertising of a product.
    
    joshuamorton 2 months ago
    
    I will simply point out that I'm aware of larger zero down-time migrations, like this one: https://www.youtube.com/watch?v=ih97gwNmkRA
    
    Snelius 2 months ago
    
    Thank you for the link but it's not the same case ;) Google used storage switching which has migration in mixed mode, i.e. migration on demand when data migrated due user access to. API had compatibility layer to read/write from/to both storage systems (i built kind of this migration mechanism about decade ago). And google spend about 8 years for the migration which ok. And the article about Database migration which can be periodical process (critical scheme changes for example) and they describe it to us. Take snapshot and racing with snapshot overhead changes and etc. I think we can let's over here. It's not a zero downtime solution cuz it's not exists.
  - mattlord 2 months ago
    
    So you don't understand how something works. That's fine. But to then say the article and/or tech are BS is... a choice.
    This work has been and is being used by some of the largest sites / apps in the world including Uber, Slack, GitHub, Square... But sure, "it's BS, super custom, and irrelevant". Gee, yer super smart! Thank you for the amazing insights. 5 stars.

mattlord 2 months ago

Blog post author here. I'm happy to answer any related questions you may have.

redwood 2 months ago

That 400TB in the image is a large database! I'm guessing that's not the largest in the PlanetScale fleet either. Very impressive and a reminder that you're strongly differentiated against some of the recent database upstarts in terms of battle tested mission critical scale. Out of curiosity how many of these large clusters are using your true managed 'as a service' offering or are they mostly in the bring your own cloud mode? Do you offer zero downtime migrations from bring your own cloud to true as a service?
- mattlord 2 months ago
  
  That particular cluster has grown significantly since the post was written, and yes there are now quite a few others that are challenging it for the "largest" claim. :-)
  These larger ones are fully using the PlanetScale SaaS, but they are using Managed -- meaning that there are resources dedicated to and owned by them. You can read more about that here: https://planetscale.com/docs/vitess/managed
  All of the PlanetScale features, including imports and online schema migrations or deployment requests (https://planetscale.com/docs/vitess/schema-changes/deploy-re...) are fully supported with PlaneScale Managed.
  - redwood 2 months ago
    
    Understood: that's great for your customers' EDP negotiations with their cloud providers!
willquack 2 months ago

> you can run an initial VDiff, and then resume that one as you get closer to the cutover point.
VDiff (v2) only compares the source and destination at a specific point in time with resume only comparing rows with PK higher than the last one compared before it was paused. I assume this means:
1. VDiff doesn't catch updates to rows with PK lower than the point it was paused which could have become corrupt, and
2. VDiff doesn't continuously validate cdc changes meaning (unless you enforce extra downtime to run / resume a vdiff) you can never be 100% sure if your data is valid before SwitchTraffic
I'm curious if this is something customers even care about, or is point in time data validation sufficient enough to catch any issues that could occur during migrations?
- mattlord 2 months ago
  
  You are correct about resuming. If you do an initial VDiff and then resume that same VDiff say 1 month later it would only diff rows with a higher PK value.
  But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time.
  - freakynit 2 months ago
    
    "But there's also nothing stopping you from doing a new VDiff to cover all data at that later point in time." --- isn't this just pushing the same issue forward in time? How is data consistency maintained if a customer reverts back to original while having served a few request from new one already?
    
    mattlord 2 months ago
    
    It's open source. If you really want to know these things, I would encourage you to look at the code and read the documentation. As noted in the blog post, reverse vreplication is setup when you switch. You can switch back and forth and nothing is lost.
    https://github.com/vitessio/vitess
    https://vitess.io/docs/reference/vreplication/
    "isn't this just pushing the same issue forward in time?" I don't understand what you are trying to say here. You can only compare the two sides / databases at the same logical point in time. While you are doing this comparison at that point in time, the timeline continues to progress. Unless you want to stop the world and prevent writes for the full duration of the diff (which can be days or even weeks).
  - willquack 2 months ago
    
    Thanks for responding!!
    I think it's still the same issue where data modified after the VDiff point in time isn't validated before SwitchTraffic. I'm mostly curious how vitess users handle this case, or if any users even care about about this case in the first place?
    Is there no demand for continuous data validation similar to what TiDB offers?
    Do people who care about 100% correct data validation just accept the downtime required to run a full VDiff before SwitchTraffic?
l5870uoo9y 2 months ago

What does it cost to host a 400TB database?
- freakynit 2 months ago
  
  Enterprise grade nvme ssd's typically cost around 150$/TB. For RF of 3, this comes to around: 400 x 3 x 150: 180K USD. With a minimum of 5 year lifecycle for these enterprise SSD's, we are looking at 36K USD/year.
  Going through their pricing (https://planetscale.com/pricing?engine=vitess&cluster=M-5120...), for just 15TB storage with RF=3, the pricing comes to around 24000 USD/MONTH, not year. Adjusted for 400TB and per year, this becomes 7.6 million usd. Of course, you also get a lot more, but, the difference is just insane.
  - Dylan16807 2 months ago
    
    That comparison doesn't make any sense at all, and you can't excuse it by tossing out "Of course, you also get a lot more". This is like evaluating the price of wheels by buying entire cars. You wouldn't get dozens of these servers just for capacity, you'd get a custom quote.
    That said at $24K you could pay off an entire server like that from Dell in 4 months despite Dell charging something stupid like $2000/TB.
    
    freakynit 2 months ago
    
    Lets hear your numbers then.
    
    Dylan16807 2 months ago
    
    Your numbers are basically fine for what you're measuring, if you round up to factor in actually having servers to put the storage drives into. So 40-50k instead of 36k.
    The issue is your budget is for 400TB of data but minimal requests per second. That's a valid thing to consider, but it's extremely apples and oranges to a fleet of 75 high powered servers.
    To put it a different way, their prices are pretty high but the calculation of powerful servers costing 40x as much as raw storage isn't "insane".

WaitWaitWha 2 months ago

I split step 4 in their "high level, this is the general flow for data migrations".

4.0 Freeze old system

4.1 Cut over application traffic to the new system.

4.2 merge any diff that happened between snapshot 1. and cutover 4.1

4.3 go live

to me, the above reduces the pressure on downtime because the merge is significantly smaller between freeze and go live, than trying to go live with entire environment. If timed well, the diff could be minuscule.

What they are describing is basically, live mirror the resource. Okay, that is fancy nice. Love to be able to do that. Some of us have a mildly chewed bubble gum, a foot of duct tape, and a shoestring.

dheera 2 months ago

Yeah it depends on what the system is.
Lots of systems can tolerate a lot more downtime than the armchair VPs want them to have.
If people don't access to Instagram for 6 hours, the world won't end. Gmail or AWS S3 is a different story. Therefore Instagram should give their engineers a break and permit a migration with downtime. It makes the job a lot easier, requires fewer engineers and cost, and is much less likely to have bugs.

redwood 2 months ago

Worth underlining that this is data migrations from one database server or system to another rather than schema migrations

ksec 2 months ago

Missing 2024 in the Title.

Settings

Zero downtime migrations at petabyte scale (2024)

Keyboard Shortcuts