Tracing Sucks - NFHN Reader

Distributed traces are such a compelling idea for debugging systems. What if you could fully understand the lifecycle of an operation, end to end? Lower fidelity than runtime profiling, but enough that every major operation is mapped out, and you can understand callers. Sounds great!

In practice, it’s an awful experience, you never can get the instrumentation just right, and the cost quickly becomes a burden.

#What is Tracing?

If you’ve not implemented tracing before, let’s talk about a few terms before we dive into why you should avoid it.

Trace ID is the shared identifier that gets passed between services. It is a GUID.
Span is a structured event within a trace (a single operation). It’s unique on (trace_id, span_id). When folks say tracing they often mean recording spans.
Context is a set of general structured attributes.

The most important part of tracing is the propagation of the Trace ID between services. In practice though this is also the most difficult problem, and there are a few reasons for that.

First, you often don’t control the abstractions, so implementing this propagation is hard. For example, you might be using a platform that hides some of these details from you, and finding a way to inject the outbound trace ID (instrumenting the spot where it calls the network service) is already difficult. You then also have to instrument the inbound receiver to make sure it follows the continuation correctly. Even more so, you’re left with the question of when do you propagate. If you have a worker system that fans out tasks, should those pass the trace ID or not? It’s entirely subjective, and it depends on the way you reason about your system, which leads us to our next concern.

OpenTelemetry attempts to do auto instrumentation for you, but that auto instrumentation is often unreliable or doesn’t map to how you reason about your systems. It’s so painfully bad in the JavaScript ecosystem that I now opt out of almost all auto instrumentation and instead choose to do it myself. The problem is it’s very hard for vendors like us (Sentry) to ensure we have great instrumentation for every single library in the world, and OpenTelemetry’s attempt to standardize it has scarred the industry with bloated interfaces and packages. Suffice to say it’s an ugly mess, and manual instrumentation is the only real option you have.

Bringing us to another major point, instrumentation is just plain difficult. It requires you to have trace context everywhere with something like a thread local. You have to always have the parent span ID when you capture a new span (as well as the trace ID) to ensure things are accurately represented. In the abstract this doesn’t sound too bad, and it’s probably the least broken part of any tracing abstraction, but it’s still a lot of complexity, especially when you go back to the challenge of getting continuation right in frameworks.

Lastly, the volume and cost of data are obscene, and the value you get out of it is hard to justify. The solution to all telemetry problems is sampling, but how do you sample a trace? Ask yourself that question. I don’t have the answer, because it’s also subjective and some variations are technically not feasible. Sample on the Trace ID you might think. Sounds great! What if the trace is active for weeks? How do you do that reliably? What if you’re not literally just randomly sampling on trace IDs and are using a component (“enterprise customers”) and need the complete trace from those customers?

There’s a variety of other nitpicks and growing complexity when you try to adopt tracing, but if you’re not already deep into it it’s just going to overload your brain (search Span Events, Span Links, or just look at the OpenTelemetry docs to get an idea). It’s exhausting.

#What do we do?

I think the best answer for most folks is to simply avoid using it, at least in the traditional sense.

Think about what you want out of traces? You’re mostly using them as structured logs. You’re mostly debugging problems. There are a few things they push that are valuable, and you can gain the benefit of those concerns while avoiding a lot of the complexity:

You want semantic conventions. These add a lot of value just for consistency in your systems, and for vendors to translate meaning out of things. This is (IMO) the single most valuable thing OpenTelemetry has delivered.
Trace propagation is extremely valuable, but you need to ensure your system is still usable without it. That means make an effort to map requests across systems, but also create a constraint to not require it (this will help with ensuring you can approach sampling more cheaply).
Structured logs - or events - are really the primary goal here. There’s no reason you shouldn’t already be using them, and then tracing becomes as simple as adding a trace_id attribute to every log entry (and a span_id if you so desire). This isn’t a new idea, we’ve had request_id patterns for decades!

Sentry’s approach actually hedges on all of this. We decided to apply trace continuation (the best we can, it’s still far from perfect) everywhere. Every dataset we curate has trace properties attached to it, which means any event (spans, logs, metrics, errors) can be “traced” if you will, even if you don’t jump through hoops to collect spans.

That means you can completely opt out, and my advice to you is to do that, and use logs instead.

Flatten the logs, use semantic conventions, rely on something like our trace propagation or roll your own. You do not need granular caller accuracy in 99% of scenarios you will face in the real world, and the remaining ones you can put an engineer (or even an LLM) on the problem and they’ll be able to sort it out.

// Convert an object with nested properties to foo.bar dot notation for logging.
function flatten(obj: object, prefix = ""): Record<string, unknown> {
  const out: Record<string, unknown> = {};
  for (const [key, value] of Object.entries(obj)) {
    const path = prefix ? `${prefix}.${key}` : key;
    if (value && typeof value === "object" && !Array.isArray(value)) {
      Object.assign(out, flatten(value as object, path));
    } else {
      out[path] = value;
    }
  }
  return out;
}

/**
 * Structured log with flattened context.
 *
 * log('request finished', { user: { id: 1 } });
 */
export function log(
  message: string,
  context: Record<string, unknown>,
  level = "info",
) {
  console.log({
    ...flatten(context),
    level,
    message,
    ...getTraceContext(),
  });
}

I now follow this practice in almost all projects I spin up and it’s taken the chore out of instrumentation and made it far more usable again. The products around logs are simply better (ever tried streaming traces to your console?), it’s simpler for you to implement, and people understand how it works.

Anyways, this is my opinion. It is shaped by the numerous interactions I have with Sentry customers and my own experience. Traces (as in spans) are great if you can tolerate the challenges, but most of you don’t need them. I certainly don’t.