Understanding Timings in Distributed Systems

How we combined logging and tracing to make life much easier.

Measuring the time spent between two points in a request’s lifecycle in a distributed system sounds simple enough, right? You just log the start and end times, look at a chart, and voilà, problem solved. But when a customer raises an issue about a specific delay, and you dive into your logs, charts, and metrics to diagnose it, you quickly realize that connecting the dots across disparate systems is anything but easy. Despite advanced tooling, observability graphs, dashboards, and well-structured logging, achieving a full picture of system behavior can still be challenging.

This problem recently became apparent for my team when a high-MRR customer reported delays in receiving specific SMS messages on their test numbers. Investigating these specific messages meant diving into our various observability tools to see if we could pinpoint the timing.

At Klaviyo, we rely on a range of tools to provide visibility into our system, including:

Splunk and Logstash for transforming and querying structured log data
Grafana, Graphite, and StatsD for real-time metric aggregation and visualization
Redshift for OLAP queries related to log events
Jaeger and Clickhouse for tracing
Sentry for exception tracking

Each one of these tools are useful but have inherent limitations when it comes to providing a coherent picture of holistic execution.

Splunk and Logstash

Pro:

Provide formatted content of structured log events with related unstructured data.

Con:

Distinct ordering of logs is crucial to understand how events relate to each other within the control flow.
There needs to be a specific identifier to associate these logs across domain boundaries.
Detailed time-related fields, including start, end, and duration timestamps, are needed to accurately capture the lifecycle of an event.

Grafana, Graphite, StatsD

Pro:

These tools provide an easy-to-visualize view of real-time events, making it straightforward to monitor system performance.

Con:

Cardinality becomes a concern with Graphite’s storage, particularly when it comes to rolling up aggregations. This limitation makes it impractical to create stat buckets based on unique identifiers, like request ids. Consequently, we often need to limit our stats to a higher level of granularity to maintain performance and manage storage efficiently.

Redshift

Pro:

Using SQL as the primary query language simplifies the process of performing joins and aggregations across various unstructured log sources and enables more versatile data analysis.

Con:

Executing queries can be slow, especially when dealing with a large volume of ingested log events.
Redshift provides only one guaranteed timestamp field, which indicates when a log was ingested, instead of granular timing stats like start, stop, and duration timings which need to be added explicitly.

Sentry

Pro:

Makes it easier to identify and troubleshoot code-level issues, bugs, and runtime errors in real time.
Captures detailed stack traces and data associated with errors which helps with providing context around errors.
Tracks user sessions and allows you to see how many users are affected by each error, which helps determine scope of an issue.

Con:

While effective for tracking exceptions, it does not inherently associate events across domain boundaries, making it challenging to connect events in a control flow.
Costs can rise quickly with high error volumes.
Sentry primarily focuses on exceptions rather than capturing start, end, or duration timings; not intended for non-exceptional event lifecycle tracking.

We quickly realized that our existing tooling fell short in addressing what seemed like a straightforward question due to several limitations:

Association of events: Understanding how one event associates with another and how each fits within the overall control flow.
Granularity of timings: Obtaining distinct timings for each event, including start time, end time, and duration (calculated as end time − start time).
Sampling strategies: Acknowledging that not all events hold the same significance. We needed to prioritize “high-priority” events while potentially reducing instrumentation for less critical ones.

For instance, when associating events, we encountered challenges with logs and events that were recorded in one system using specific identifiers relevant to that domain, but as these events crossed domain boundaries, the identifiers were not consistently propagated. For example, batch_id identifies a particular batch of SMS messages in one system, but a different system would log a transmission_id with the same batch, but not log the corresponding batch_id.

Get Martin Fahy’s stories in your inbox

Join Medium for free to get updates from this writer.

This discrepancy made it difficult to join events across systems since they lacked a common attribute for association. Even when we managed to join events, our logs often failed to capture essential timing information, such as start time, end time, and duration, for logs outside our domain, as well as the ordering needed to accurately associate the control flow of events.

Jaeger for Distributed Tracing

Recently, we introduced Jaeger into our observability stack. What is a Jaeger? At a very high level it’s a tool that maps the flow of requests and data as they traverse a distributed system.

Jaeger gives you the following in a trace:

A unique trace id.
A distinct start and stop time for the event.
A sequential ordering of events.
A visual breakdown of per method execution.
The ability to capture the input and output of a function.
Context propagation meaning that if others implement tracing in their specific domain you get insight into the execution of that system for free.

Simply put, in Python, we add a decorator at the start of a workload and Jaeger automatically traces the entire execution flow.

Press enter or click to view image in full size

Jaeger UI showing the request path of processing a transmission batch.

Unfortunately, utilizing Jaeger does not come without its challenges. The spans generated by Jaeger can become quite large, making it impractical to trace every request.

As a result, our sample rate for traces is set to a default of 0.01% of all traffic. While this helps manage performance and storage, it comes with the downside of providing only a high level understanding of overall system performance. Consequently, we are often unable to track specific events, such as those with a particular customer.

Enhancing Observability with Associated Logs and Traces

Recently, our team overcame several limitations by implementing a few strategic changes.

First, we integrated Jaeger trace ids into our Splunk logs, allowing us to easily identify Jaeger events from any Splunk log entry. By associating these logs with Jaeger spans, we are able to efficiently look up the corresponding trace in Jaeger:

Press enter or click to view image in full size

By logging trace_id, we can easily move from a log entry in Splunk to the Jaeger trace.

Incorporating the trace id into our Splunk logs allows us to directly retrieve traces by entering the trace id into our Jaeger UI. This integration transforms isolated log entries into essential components of a broader control flow. By establishing this connection, we can gain valuable context regarding execution paths, enabling us to obtain insights into event data at specific moments. This streamlines the debugging process, as we can now understand the sequence of events leading up to an event, along with the state information.

Additionally, we enhanced integration between our logging and tracing tools by including log data as part of the Jaeger events. This change minimizes the need to context switch between the Jaeger UI and Splunk UI, as we can now capture any relevant information with the event in one cohesive view.

Press enter or click to view image in full size

View of part of a Jaeger trace in Grafana. The logs section comes from logging statements in the code. Previously this information was only available in Splunk.

We also took advantage of Jaeger’s capability to attach exception tracebacks, allowing us to capture detailed stack information for any encountered exceptions.

Press enter or click to view image in full size

In the Jaeger UI, we can now see the full stack trace for exceptions.

These enhancements provided several immediate benefits for any method annotated with tracing:

When looking at logs in Splunk, we could easily access all duration metrics for the method.
We could easily visualize execution flow across domain boundaries including exception tracebacks.
Easy access to more logged state information made it easier to reproduce issues.

With these improvements, we can now access the Jaeger trace corresponding to the sending of a specific SMS in under a minute. We can quickly ascertain the total execution time and confidently inform our internal or external customers if a delay originated in our systems or elsewhere. This efficiency contrasts sharply with the hours it previously took to answer the same question before enhancing our tooling!

Customizing Sampling Strategies

However, with a uniform random sample of only 0.01% of requests, the likelihood of capturing the trace for any specific event was quite low. To address this limitation, we needed to enhance our sampling strategy. Certain events are inherently more critical, especially those involving high-volume customers or customers with a history of issues. Prioritizing these executions is essential for debugging potential regressions in our service level objectives (SLOs) and service level agreements (SLAs). This insight led us to focus on capturing more events from specific companies.

We decided to maintain the default sampling rate at 0.01% while allowing for customizable sampling rates based on business rules. For instance, we could implement a 50% sampling rate for company ABC.

To achieve this, we leveraged our internal hotsettings system, which enables us to dynamically configure settings stored in a Redis cluster and retrieved in real time as a hash set. This flexibility allowed us to define customized business rules through a JSON schema for any attribute we wished to trace, complete with its sampling rate. For example, if we wanted to log 50% of all events for the customer with id ABC, our schema would look like this:

{
    "company_id": {
        "ABC": 50
    }
}

With this setup, we could dynamically adjust, in real time, how we capture events based on any specified attribute in our JSON schema, allowing us to increase or decrease the volume of information collected as needed.

The Outcome: Efficient, Focused Observability

The end result:

Configurable sample rates let us limit excessive data capture while dynamically capturing the most relevant events.
Merging logs with traces provides a complete view of execution flow, covering start and end times, duration, input/output data, and cross-domain context propagation.
We can now quickly locate relevant Jaeger traces based on any attribute by first querying logs in Splunk.

The combination of these benefits have helped us answer questions around specific latency from hours of looking at logs to a few seconds!