Reliability, constant work, and a good cup of coffee

It’s hard to think of a more critical function than health checks. If an instance, server, or Availability Zone loses power or networking, health checks notice and ensure that requests and traffic are directed elsewhere. Health checks are integrated into the Amazon Route 53 DNS service, into Elastic Load Balancing load balancers, and other services. Here we cover how the Route 53 health checks work. They’re the most critical of all. If DNS isn’t sending traffic to healthy endpoints, there’s no other opportunity to recover.

From a customer’s perspective, Route 53 health checks work by associating a DNS name with two or more answers (like the IP addresses for a service’s endpoints). The answers might be weighted, or they might be in a primary and secondary configuration, where one answer takes precedence as long as it’s healthy. The health of an endpoint is determined by associating each potential answer with a health check. Health checks are created by configuring a target, usually the same IP address that’s in the answer, such as a port, a protocol, timeouts, and so on. If you use Elastic Load Balancing, Amazon Relational Database Service, or any number of other AWS services that use Route 53 for high availability and failover, those services configure all of this in Route 53 on your behalf.

Route 53 has a fleet of health checkers, broadly distributed across many AWS Regions. There’s a lot of redundancy. Every few seconds, tens of health checkers send requests to their targets and check the results. These health-check results are then sent to a smaller fleet of aggregators. It’s at this point that some smart logic about health-check sensitivity is applied. Just because one of the ten in the latest round of health checks failed doesn’t mean the target is unhealthy. Health checks can be subject to noise. The aggregators apply some conditioning. For example, we might only consider a target unhealthy if at least three individual health checks have failed. Customers can configure these options too, so the aggregators apply whatever logic a customer has configured for each of their targets.

So far, everything we’ve described lends itself to constant work. It doesn’t matter if the targets are healthy or unhealthy, the health checkers and aggregators do the same work every time. Of course, customers might configure new health checks, against new targets, and each one adds slightly to the work that the health checkers and aggregators are doing. But we don’t need to worry about that as much.

One reason why we don’t worry about these new customer configurations is that our health checkers and aggregators use a cellular design. We’ve tested how many health checks each cell can sustain, and we always know where each health checking cell is relative to that limit. If the system starts approaching those limits, we add another health checking cell or aggregator cell, whichever is needed.

The next reason not to worry might be the best trick in this whole article. Even when there are only a few health checks active, the health checkers send a set of results to the aggregators that is sized to the maximum. For example, if only 10 health checks are configured on a particular health checker, it’s still constantly sending out a set of (for example) 10,000 results, if that’s how many health checks it could ultimately support. The other 9,990 entries are dummies. However, this ensures that the network load, as well as the work the aggregators are doing, won’t increase as customers configure more health checks. That’s a significant source of variance … gone.

What’s most important is that even if a very large number of targets start failing their health checks all at once—say, for example, as the result of an Availability Zone losing power—it won’t make any difference to the health checkers or aggregators. They do what they were already doing. In fact, the overall system might do a little less work. That’s because some of the redundant health checkers might themselves be in the impacted Availability Zone.

So far so good. Route 53 can check the health of targets and aggregate those health check results using a constant work pattern. But that’s not very useful on its own. We need to do something with those health check results. This is where things get interesting. It would be very natural to take our health check results and to turn them into DNS changes. We could compare the latest health check status to the previous one. If a status turns unhealthy, we’d create an API request to remove any associated answers from DNS. If a status turns healthy, we’d add it back. Or to avoid adding and removing records, we could support some kind of “is active” flag that could be set or unset on demand.

If you think of Route 53 as a sort of database, this appears to make sense, but that would be a mistake. First, a single health check might be associated with many DNS answers. The same IP address might appear many times for different DNS names. When a health check fails, making a change might mean updating one record, or hundreds. Next, in the unlikely event that an Availability Zone loses power, tens of thousands of health checks might start failing, all at the same time. There could be millions of DNS changes to make. That would take a while, and it’s not a good way to respond to an event like a loss of power.

The Route 53 design is different. Every few seconds, the health check aggregators send a fixed-size table of health check statuses to the Route 53 DNS servers. When the DNS servers receive it, they store the table in memory, pretty much as-is. That’s a constant work pattern. Every few seconds, receive a table, store it in memory. Why does Route 53 push the data to the DNS servers, rather than pull from them? That’s because there are more DNS severs than there are health check aggregators. If you want to learn more about these design choices, check out Joe Magerramov’s article on putting the smaller service in control.

Next, when a Route 53 DNS server gets a DNS query, it looks up all of the potential answers for a name. Then, at query time, it cross-references these answers with the relevant health check statuses from the in-memory table. If a potential answer’s status is healthy, that answer is eligible for selection. What’s more, even if the first answer it tried is healthy and eligible, the server checks the other potential answers anyway. This approach ensures that even if a status changes, the DNS server is still performing the same work that it was before. There’s no increase in scan or retrieval time.

I like to think that the DNS servers simply don’t care how many health checks are healthy or unhealthy, or how many suddenly change status, the code performs the very same actions. There’s no new mode of operation here. We didn’t make a large set of changes, nor did we pull a lever that activated some kind of “Availability Zone unreachable” mode. The only difference is the answers that Route 53 chooses as results. The same memory is accessed and the same amount of computer time is spent. That makes the process extremely reliable.