OTelBench - OpenTelemetry AI Benchmark

2 min read Original article ↗

Distributed tracing requires stitching together distinct user journeys across complex microservices, rather than just writing isolated functions. We tested whether top models can successfully instrument applications with OpenTelemetry to see if they are actually ready to handle real-world Site Reliability Engineering tasks.

Read our blog post introducing OTelBench: Benchmarking OpenTelemetry: Can AI trace your failed login?

Model ranking #

Models ranked by their success rate in modifying code to correctly emit telemetry data. The table includes total cost and time for the full benchmark run to help contextualize performance. See our full methodology for validation details.

View all models →

Instrumentation tasks #

The benchmark covers a diverse set of coding challenges across languages including .NET, C++, Erlang, Go, Java, JavaScript, PHP, Python, Ruby, Rust, and Swift. We sort these by difficulty, where tasks with a 0% pass rate represent currently unsolved problems in automated instrumentation.

View all tasks →

Performance by language #

No languages available for this category

Average pass rate across all models for each programming language. Languages with more training data and mature OpenTelemetry libraries tend to be easier for AI models to instrument correctly.

Model-task matrix #

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific instrumentation patterns well, even if their overall score is lower.

Cost efficiency #

We map total API cost against success rate. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

Speed vs quality #

This chart compares accuracy against average generation time, helping identify models that balance solution quality with response latency.

Performance over time #

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on observability tasks compares across model generations.

Run it yourself #

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/otel-bench. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.