Automatic Redis Failover for Ruby

29 points by ryanlecompte 14 years ago · 11 comments · 3 min read

Reader

I just released redis_failover, a new Ruby gem that provides automatic Redis failover support for Ruby. From the project README:

Redis Failover attempts to provides a full automatic master/slave failover solution for Ruby. Redis does not provide an automatic failover capability when configured for master/slave replication. When the master node dies, a new master must be manually brought online and assigned as the slave's new master. This manual switch-over is not desirable in high traffic sites where Redis is a critical part of the overall architecture. The existing standard Redis client for Ruby also only supports configuration for a single Redis server. When using master/slave replication, it is desirable to have all writes go to the master, and all reads go to one of the N configured slaves.

This gem attempts to address both the server and client problems. A redis failover server runs as a background daemon and monitors all of your configured master/slave nodes. When the server starts up, it automatically discovers who is the master and who are the slaves. Watchers are setup for each of the redis nodes. As soon as a node is detected as being offline, it will be moved to an "unreachable" state. If the node that went offline was the master, then one of the slaves will be promoted as the new master. All existing slaves will be automatically reconfigured to point to the new master for replication. All nodes marked as unreachable will be periodically checked to see if they have been brought back online. If so, the newly reachable nodes will be configured as slaves and brought back into the list of live servers. Note that detection of a node going down should be nearly instantaneous, since the mechanism used to keep tabs on a node is via a blocking Redis BLPOP call (no polling). This call fails nearly immediately when the node actually goes offline.

This gem provides a RedisFailover::Client wrapper that is master/slave aware. The client is configured with a single host/port pair that points to redis failover server. The client will automatically connect to the server to find out the current state of the world (i.e., who's the current master and who are the current slaves). The client also acts as a load balancer in that it will automatically dispatch Redis read operations to one of N slaves, and Redis write operations to the master. If it fails to communicate with any node, it will go back and ask the server for the current list of available servers, and then optionally retry the operation.

antirez 14 years ago

In case you wonder, after Redis 2.6 RC1 this is my #1 commitment, to provide a standard Redis failover tool. Note that Redis Cluster that will be also be one of the big focus after 2.6 is not the real fix for this: many users just have two instances, one master and a slave for failover, or multiple instances that are coupled this way and are conceptually single servers. Often they don't actually need Redis Cluster, or even can not use Redis Cluster (because does not implement the full Redis API, but a subset). So what happens? All this users have to invent an HA system for Redis again and again.

It's still a work in progress but the idea is that the standard Redis failover will be based on a stand-alone daemon that is called redis-sentinel that you can place at different positions in your network. It talks with other redis-sentinels, and if there are the right condition the failover is performed. So there is no proxy or alike, nor the server itself will be touched.

ryanlecompteOP 14 years ago

FYI, redis_failover has now been rewritten to sit on top of ZooKeeper to deal with network partitions, stability, and data consistency. From the README:

redis_failover attempts to provides a full automatic master/slave failover solution for Ruby. Redis does not provide an automatic failover capability when configured for master/slave replication. When the master node dies, a new master must be manually brought online and assigned as the slave's new master. This manual switch-over is not desirable in high traffic sites where Redis is a critical part of the overall architecture. The existing standard Redis client for Ruby also only supports configuration for a single Redis server. When using master/slave replication, it is desirable to have all writes go to the master, and all reads go to one of the N configured slaves.

This gem attempts to address these failover scenarios. A redis failover Node Manager daemon runs as a background process and monitors all of your configured master/slave nodes. When the daemon starts up, it automatically discovers the current master/slaves. Background watchers are setup for each of the redis nodes. As soon as a node is detected as being offline, it will be moved to an "unavailable" state. If the node that went offline was the master, then one of the slaves will be promoted as the new master. All existing slaves will be automatically reconfigured to point to the new master for replication. All nodes marked as unavailable will be periodically checked to see if they have been brought back online. If so, the newly available nodes will be configured as slaves and brought back into the list of available nodes. Note that detection of a node going down should be nearly instantaneous, since the mechanism used to keep tabs on a node is via a blocking Redis BLPOP call (no polling). This call fails nearly immediately when the node actually goes offline. To avoid false positives (i.e., intermittent flaky network interruption), the Node Manager will only mark a node as unavailable if it fails to communicate with it 3 times (this is configurable via --max-failures, see configuration options below).

This gem provides a RedisFailover::Client wrapper that is master/slave aware. The client is configured with a list of ZooKeeper servers. The client will automatically contact the ZooKeeper cluster to find out the current state of the world (i.e., who is the current master and who are the current slaves). The client also sets up a ZooKeeper watcher for the set of redis nodes controlled by the Node Manager daemon. When the daemon promotes a new master or detects a node as going down, ZooKeeper will notify the client near-instantaneously so that it can rebuild its set of Redis connections. The client also acts as a load balancer in that it will automatically dispatch Redis read operations to one of N slaves, and Redis write operations to the master. If it fails to communicate with any node, it will go back and fetch the current list of available servers, and then optionally retry the operation.

ryanlecompteOP 14 years ago

From the README:

clofresh 14 years ago

So just to clarify, the client code still connects directly to the actual redis instance, it doesn't connect to the failover daemon as a proxy? It would be useful to describe the connection algorithm in the README.

Also, the split of the reads and writes, that's done via opening connections to both the master and slave?

ryanlecompteOP 14 years ago

That's right. The client still maintains direct connections with the actual master/slaves. It's only when it fails to connect with one of them that it goes to the failover daemon to ask for the current set of available nodes. The split of the reads/writes is handled by the client, as it knows where to dispatch commands (to master for writes, and to one of the slaves for reads). I'll make this clearer in the README.

cheald 14 years ago

Awesome. I was implementing my own version of this with doozer and eventmachine, but I might just use this one instead!

DanWaterworth 14 years ago

What happens in the case of a network partition?

antirez 14 years ago

Can't reply for the solution posted in this article, but well, I think this is one of the main design concerns. For redis-sentinel (I described it in another comment in this thread) the trick is that you place the sentinels where you want and select a minimum number of agreement for failover, so what happens depends on where you place the sentinels and the min agreement you configure. It's easy to have the desired behavior this way.
- salimane 14 years ago
  
  something to keep in kind is that sometimes, the redis server can't accept connections anymore because of limits etc...but the server is still serving old connections. so in that case, i think you don't want to just failover... the tricky part is to know if the server is really down
  - antirez 14 years ago
    
    There is no sane condition in Redis that will make it not replying at all AFAIK, even if you set maxclients to 1 the next clients will have an error returned (and the connection closed ASAP). But yes, it is important to understand what down means. I think one of the safest things to do is "down == unreachable". So if you don't get any reply at all, for the configured amount of time consecutively, the server is down. And of course the other redis-sentinels have to agree for the fail over to start.
    
    ryanlecompteOP 14 years ago
    
    The gem has a configurable --max-failures option that can be passed to failover daemon. The daemon will only mark a node as being unreachable if it fails to ping that amount of times (default 3). This might be something that can be improved too, but it was meant to avoid false positives.

Settings

Automatic Redis Failover for Ruby

Keyboard Shortcuts