Instrumenting distributed systems for operational visibility

To operate services according to our high standards on availability and latency, we as service owners need to measure how our systems behave.

To get the necessary telemetry, service owners measure the operational performance from multiple places to get multiple perspectives on how things are behaving end-to-end. This is complicated even in a simple architecture. Consider a service that customers call through a load balancer: the service talks to a remote cache and remote database. We want each component to emit metrics about its behavior. We also want metrics on how each component perceives the behavior of other components. When metrics from all of these perspectives are brought together, a service owner can track down the source of problems quickly, and dig in to find the cause.

Many AWS services automatically provide operational insights about your resources. For example, Amazon DynamoDB provides Amazon CloudWatch metrics on success and error rates and latency, as measured by the service. However, when we build systems that use these services, we need a lot more visibility into how our systems are behaving. Instrumentation requires explicit code that records how long tasks take, how often certain code paths are exercised, metadata about what the task was working on, and what parts of the tasks succeeded or failed. If a team doesn’t add explicit instrumentation, it would be forced to operate its own service as a black box.

For example, if we implemented a service API operation that retrieved product information by product ID, the code might look like the following example. This code looks up product info in a local cache, followed by a remote cache, followed by a database:

public GetProductInfoResponse getProductInfo(GetProductInfoRequest request) {
  // check our local cache
  ProductInfo info = localCache.get(request.getProductId());
  // check the remote cache if we didn't find it in the local cache
  if (info == null) {
    info = remoteCache.get(request.getProductId());
	localCache.put(info);
  }
  // finally check the database if we didn't have it in either cache
  if (info == null) {
    info = db.query(request.getProductId());	
	localCache.put(info);
	remoteCache.put(info);
  }
  return info;
}

If I were operating this service, I’d need a lot of instrumentation in this code to be able to understand its behavior in production. I’d need the ability to troubleshoot failed or slow requests, and monitor for trends and signs that different dependencies are underscaled or misbehaving. Here’s that same code, annotated with a few of the questions I’d need to be able to answer about the production system as a whole, or for a particular request:

public GetProductInfoResponse getProductInfo(GetProductInfoRequest request) {
  // Which product are we looking up?
  // Who called the API? What product category is this in?
  // Did we find the item in the local cache?
  ProductInfo info = localCache.get(request.getProductId());
  if (info == null) {
    // Was the item in the remote cache?
    // How long did it take to read from the remote cache?
    // How long did it take to deserialize the object from the cache?
    info = remoteCache.get(request.getProductId());
    // How full is the local cache?
    localCache.put(info);
  }
  // finally check the database if we didn't have it in either cache
  if (info == null) {
    // How long did the database query take?
    // Did the query succeed? 
    // If it failed, is it because it timed out? Or was it an invalid query? Did we lose our database connection?
    // If it timed out, was our connection pool full? Did we fail to connect to the database? Or was it just slow to respond?
    info = db.query(request.getProductId());
    // How long did populating the caches take? 
    // Were they full and did they evict other items? 
    localCache.put(info);
    remoteCache.put(info);
  }
  // How big was this product info object? 
  return info;
}

The code for answering all of those questions (and more) is quite a bit longer than the actual business logic. Some libraries can help reduce the amount of instrumentation code, but the developer still must ask the questions about the visibility that the libraries will need, and then the developer must be intentional about wiring in the instrumentation.

When you troubleshoot a request that flows through a distributed system, it can be difficult to understand what happened if you only look at that request based on one interaction. To piece together the puzzle, we find it helpful to pull together in one place all of the measurements about all of these systems. Before we can do that, each service must be instrumented to record a trace ID for each task, and to propagate that trace ID to each other service that collaborates on that task. Collecting the instrumentation across systems for a given trace ID can be done either after the fact as needed, or in near real-time using a service like AWS X-Ray.