Zero Downtime Datacenter Migration

Through some clever manipulation we migrated one of our core services to a new datacenter in the middle of a week day during peak traffic without downtime. This is the story of how we did it.

Why the move?

First, some back story on why we moved datacenters in the first place.

We’ve been longtime Slicehost customers here at Wistia. All our machines were in their original datacenter in St. Louis (STL-A, if you’re in the know).

Earlier this year, we learned that Rackspace (who now owns Slicehost) is planning on doing away with that datacenter, and we’d be forced to move.

We were a bit nervous when we first heard this because our architecture was no longer just a few boxes, and we’re fortunate to now have lots of customers who depend on our software. No more secret 4am database juggling while watching the logs like hawks to make sure no one is using the service. People don’t stop using it now!

Which service to move first?

So what does our architecture look like? The three major components are the Wistia application itself, our video encoding platform (which we call the Bakery), and our video analytics platform (the Distillery).

They’re all extremely different in the types of resources they need and the components that make them up.

After some noodling, we decided the Bakery was the best candidate to move first. We were still working on a big upgrade to the Distillery that wasn’t quite ready yet, and the Wistia app is particularly tricky to move because it has hundreds of touch points with the customer and there’s lots of little things can go wrong (as opposed to a service consisting soley of a tight API).

The Bakery’s architecture

So what’s the Bakery look like? It’s pretty simple, really. There are three components: a database, Primes, and Breadroutes.

We have a single MySQL database that stores information about all the media customers upload. It’s nothing special. Our schema is very minimal, and there’s not any significant load on the database.

Then there’s what we call Primes. These are the main building block of the Bakery. Each Prime is a standalone Bakery in and of itself. It can accept media uploads, transcode video, store video, and serve up video. The actual pieces of software doing the work here are Nginx, Unicorns running a Rails app, and a custom task processing system written in Ruby called the Oven (keeping with the Bakery analogy, obviously).

Finally, there’s the Breadroute. This is a routing layer that sits in front of the Primes and balances traffic. It’s not a simple round-robin load balancer though. It has access to the database so it can make smart decisions about where to route each request. For instance, if you request a video and it’s available locally on a Prime in the cluster, it will route your request to the box. In this way, the Breadroutes allow all the Prime boxes to function together as a unit. The Breadroute is made up of four Ruby proxy servers built on top of Tom Preston-Werner’s lovely proxy_machine, all sitting behind HAProxy.

Above is a beautiful rendition of how this all comes together. BR is for Breadroute, P is for Prime, and you can probably guess which one the database is. I just realized I forgot to draw the connections to the database. Well, everything is connected to the database! Spoiler alert: don’t read what’s in that red box! Details on that are in the next section.

The migration strategy

The best migrations are the ones that at each step of the way you can easily move both forward with the plan and backward. Through years of doing this, I’ve developed a healthy fear of migrations with a cliff: ones where there’s that one step, that once you do it, you have to go all the way – there’s no going back.

Sometimes the cliff scenario can’t be avoided, and it’s often the most efficient path. But it sure as hell is scary, and it’s something I go out of my way to avoid.

Luckily for us, this migration was a shining example of avoiding the cliff.

The key to seamlessly migrating the Bakery lay in the Breadroute. Instead of having the Breadroute boxes only routing to Prime boxes on Slicehost, we could make them route to Primes in the new datacenter as well.

Once we realized this, the rest fell into place. Here’s what we did.

Phase I: The Setup

1. Command Center in the Rocketship

Ben and I set up a command center in the downstairs conference room (dubbed the Rocketship, see photo). We made a pact not to leave the room until the migration was complete. Blast off.

2. Clone of Slicehost

Setup a rough clone of what we have in Slicehost at Rackspace. We need a bunch of Prime boxes, a few Breadroutes, and a database

These steps were very straightforward thanks to some help from the guys at Rackspace. We were able to move an image of one of our Prime boxes from Slicehost to Rackspace. Once it was over there, we cloned it a bunch of times.

The Breadroute boxes were provisioned from scratch. We have an internal tool (called Doomcrank) that’s kind of like Puppet or Chef, and we used that to build these boxes.

And the database isn’t much more than an “apt-get install mysql-server”.

3. Master-Master MySQL replication

Enable master-master replication between the databases in both datacenters. By master-master, I mean that we can read from and write to to either database and it will be replicated to the other.

This was my first experience with MySQL replication, and I was surprised how easy it was to setup.

Here’s my writeup of how to do master-master MySQL replication.

Phase II: The Transition

This is where we started to actually shift traffic from Slicehost to Rackspace.

1. Slowly allow Slicehost Breadroutes to also route to Rackspace Primes.

Because the Breadroutes are database-backed, we have the ability to easily control where they route their traffic. Normally they’re proxying to Primes on the local private network, but they can proxy over the public internet just the same!

2. Slowly take all Slicehost Primes out of the loop.

We verified that traffic was being served via Rackspace Primes and that things were looking good. Then we started taking Slicehost Primes out of the pool.

3. Move DNS for the service to point at the Rackspace Breadroutes.

Once all traffic was being handled by Primes in Rackspace (and all Slicehost primes were out of the loop), we shifted prime.wistia.com to point at the Rackspace Breadroutes so they would handle all incoming traffic.

Before we did this though, I edited my /etc/hosts file to map prime.wistia.com to the new Breadroutes to smoke test the whole thing.

4. Triple check everything

After everything was moved over to Rackspace, we kept a really close eye on it for an hour or so. The whole migration went so eerily well that we assumed we must have done something wrong and just hadn’t caught it.

5. Party

We finally convinced ourselves that everything was right, and went out for beers at the Burren right around 6pm. We were both pretty sure this whole thing was going to take us well past midnight, so finishing early was a welcome surprise!

Recap

The nice thing about this migration was that the steps were fluid. We could easily revert any change if the slightest thing went wrong. This allowed the whole process to operate at a methodical and comfortable pace, and in my experience, that’s always very welcome when doing something this important.

Happy migrating!