The problem with OpenTelemetry
cra.mrI understand what the author is saying, but vendor lock-in with closed-source observability platforms is a significant challenge, especially for large organizations. When you instrument hundreds or thousands of applications with a specific tool, like the Datadog Agent, disentangling from that tool becomes nearly impossible without a massive investment of engineering time. In the Platform Engineering professional services space, we see this problem frequently. Enterprises are growing tired of big observability platform lock-in, especially when it comes to Datadog's opaque nature of your spend on their products, for example.
One of the promises of OTEL is that it allows organizations to replace vendor-specific agents with OTEL collectors, allowing the flexibility of the end observability platform. When used with an observability pipeline (such as EdgeDelta or Cribl), you can re-process collected telemetry data and send it to another platform, like Splunk, if needed. Consequently, switching from one observability platform to another becomes a bit less of a headache. Ironically, even Splunk recognizes this and has put substantial support behind the OTEL standard.
OTEL is far from perfect, and maybe some of these goals are a bit lofty, but I can say that many large organizations are adopting OTEL for these reasons.
I totally agree I just wish we could do it in a way that doesn’t try to lump every problem into the same bucket. I don’t see what it achieves personally, and I think it’s limiting the ability for the original goals of the project to be as successful as they could be.
I'm not sure I get what's the problem with OpenTelemetry as it is then? I'm not familiar with the JavaScript implementation, but it seems to be modular. You can just import @opentelemetry/api and @opentelemetry/sdk-trace-web, and as far as I understand you'll get the API (annotations) and the tracing implementation, but without the exporter (OTLP). You can plugin your own exporter or even just use the API - am I missing something?
I think the only issue is that the OpenTelemetry API also includes Metrics and Logs. I just tend to ignore these parts when using OpenTelemetry.
No you're not missing anything, it is that simple. With @opentelemetry/auto-instrumentations-node you can actually get the OTLP exporters "for free" as well if you require the register script.
Logs in JS is really only significant as a wire format + some transport glue code for popular loggers so it can be more or less completely ignored. I do find the Otel metrics model overcomplicated compared to OpenMetrics, though not to say OpenMetrics doesn't have its own set of tradeoffs.
With metrics and logs, I do see the advantage of a single project driving unification of signals as opposed to treating them as silos. With one set of semantic conventions and SDK available, it'll be a lot easier to tie everything together, rather than relying on engineers to do that themselves via timestamps and glancing across 3 different tabs.
I think the OP author bias here is that Sentry only really cares about the Otel instrumentation libs for tracing to power their APM product, and everything else is superfluous (hence the blog). I help build HyperDX, where we care about unifying metrics/logs along with traces, and having it all in one ecosystem is quite important for making it work well - and we do not find the rest of it superfluous! (disclaimer: obviously biased in my own ways)
Well, telemetry is defined as logs, metrics, traces... So it kinda makes sense that OTEL supports the major aspects of telemetry.
Says who? Sentry has many other types of telemetry and we’ve existed long before OTel. Who are these all knowing humans who say this is what telemetry is? Are they also going to build every collector for every kind of past current telemetry?
The whole idea that some marketing bs has translated to technology fact is why we’re in this mess.
> Says who?
For starters, it says so right on the tin.
> Sentry has many other types of telemetry and we’ve existed long before OTel.
That's fine. Just because a vendor is missing features it doesn't mean it's bad. It just scratches a specific itch a specific way. That's ok.
> Who are these all knowing humans who say this is what telemetry is?
For starters, anyone who has any first-hand experience with telemetry, either as users or as implementers.
I seriously don't know what point you are trying to make.
Guess I am unqualified to say that folks who say metrics, logs, and traces are the definition of telemetry is mistaken.
> Guess I am unqualified to say that folks who say metrics, logs, and traces are the definition of telemetry is mistaken.
I think you're confused. A set of specialized telemetry types is telemetry, and the definition of telemetry is not limited to a subset of soecialied telemetry types. You know, a cat is an animal but the definition of what's an animal is not "a cat".
I recommend you read over the topic. As you seem to complain about anything related to OpenTelemetry, perhaps reading the same answer from other vendors of telemetry systems might spark an epiphany.
https://www.splunk.com/en_us/blog/learn/what-is-telemetry.ht...
Are you having a laugh? Sentry doesn't set the industry standards.
I'm curious as to what do you mean by "lump every problem into the same bucket"?
As a backender and half platform engineer I appreciate OTel a lot, it allows me to install OTel ingesting code and it then gets sent to wherever our platform guys and girls think it's best. It allows me to only think about it once and leave the details to the people who have to maintain the infra.
I mean sure, parts (or maybe all?) of the problems in this area have other solutions i.e. we don't use OTel for logging because we already have Grafana + Loki and basically everything every app outputs in stdout / stderr gets captured and can be queried but I like the flexibility for us to fully migrate to all aspects of OTel one day if the scales tilt in another direction.
So what's your beef with all this?
(For the record, I used Sentry many times in the past and I loved it, it's a very no-BS product that I appreciated a lot -- and it adding OTel ingester / collector I viewed as something very positive.)
Yeah, it's the primary reason we used it. If OpenTelemetry's raison d'être was simply to give Datadog a reason to not bullshit their customers on pricing, it would fulfill a major need in platform services.
I don’t know what the Sentry guy is really saying - I mean you can write whatever code you want, go for it man.
But I do have to “pip uninstall sentry-sdk” in my Dockerfile because it clashes with something I didn’t author. And anyway, because it is completely open source, the flaws in OpenTelemetry for my particular use case took an hour to surmount, and vitally, I didn’t have to pay the brain damage cost most developers hate: relationships with yet another vendor.
That said I appreciate all the innovation in this space, from both Sentry and OpenTelemetry. The metrics will become the standard, and that’s great.
The problem with Not OpenTelemetry: eventually everyone is going to learn how to use Kubernetes, and the USP of many startup offerings will vanish. OpenTelemetry and its feature scope creep make perfect sense for people who know Kubernetes. Then it makes sense why you have a wire protocol, why abstraction for vendors is redundant or meaningless toil, and why PostHog and others stop supporting Kubernetes: it competes with their paid offering.
> eventually everyone is going to learn how to use Kubernetes
That seems obviously true... yet, there are so many people out there that seem unable to learn it that I don't think it's a reliable prediction.
Why necessarily Kubernetes?
For many applications, it's enough to spin up a VPS/plain Docker container, and it will run fine for many, many years, without adding the Kubernetes complexity on top.
If the application is easy to install and autoconfigures itself, it's even better than having to configure everything yourself or create multi-server Kubernetes clusters.
Why you should use kubernetes? You most likely shouldn't.
But why should you learn kubernetes? Because everybody else is using it, whether they should or not. Very few people manage to work alone on IT.
But anyway, every single possible OT user needs something like kubernetes. It comes first at the scaling requirements.
Learning k8s carries a significant cognitive cost. I've learned hundreds of technologies in my 22+ years of career and I am not about to proactively learn something that people started making certification courses for (and some are quite long, we're talking weeks of training daily). And I am definitely not proactively learning it if it's only 1% likely I'll need it. (I am now working at the second company that actually truly benefits from k8s. Out of 15+.)
I have proactively learned what can k8s do and why is it useful however. That's a context that's very needed because k8s is also a huge investment and you should really know the pros and cons before jumping in. That I did.
> unable
I wouldn't equate unwillingness or not needing it to inability to learn
Why have I heard only bad things on k8s? To the point where it’s a meme to understand k8s…
I think it gets used as part of 'best practices' a lot where it was not actually helpful. But it hasn't been my experience that knowing it is a meme or even it being a bad thing. It just gets a little silly when you see stuff like a service managed by k8s that probably made more sense as a library
Personally I could never get into k8s. For most of my use-cases, simply getting a VPS, or having a Dockerfile is enough.
Most apps can run fine for millions or hundreds of thousands of user sessions on a $5-$50 VPS. People prematurely optimize for scale, adding a lot of complexity that only makes development slower, and by having more moving parts, there are more things that can break. Start simple. Scaling is mostly a solved problem nowadays, if you quickly need to scale, there are always solutions to do so. In the worst case, you have to scale horizontally, and if you reach the limit of horizontal scaling, other your app is inefficient, or your business is already successful, so you are no longer in the "start" phase.
Yep, strong agree here, and indeed the modern hardware can achieve mind-blowing stuff. I have used a $250 laptop (i5 CPU that's now 8-9 years old, 16GB RAM, SATA SSD) to stress test some of my Rust apps and I could achieve ~250K requests per second.
Sure that doesn't apply to all tech (f.ex. my favorite Elixir is much slower than this) but for most intents and purposes a plain old VPS with good backups is quite enough and always will be. Or if you are particularly paranoid like myself: have a load balancer and have 2-3 copies of your app servers, 1 DB server, and overzealous backup scripts.
A bit offtopic, but do you recommend any load balancer?
I could have another VPS as a load balancer, but what if goes down? The load balancing services seem quite expensive, as they charge per request.
Ah, I meant use your hosting provider's load balancer + use their VPS-es. I think Digital Ocean allows this, AWS and GCP as well.
That way you're outsourcing the problem of "but what if the load balancer itself falls down?".
I usually want to load-balance between different hosting providers (so if AWS goes down, I still have DO running). From what I've seen, most of them only provide load-balancing within their local networks.
Hmmmm... Cloudflare then?
That's a good idea, I was considering it, looked at their pricing, but was too expensive.
Also, I'm starting to dislike having everything go through Cloudflare (all my DNS). I don't dislike the company, I just don't like one entity controlling the entire internet. I don't want even more vendor-lock-in tbh.
I'm gravitating towards the failover solution, to have a VPS do health-checks on my services, if one goes down, use the Cloudflare API to update the DNS entries to the backup servers.
Yeah, in that case I'd say you're best served by having a mini PC at home, connected to an UPS (or just an old laptop, their UPS is their battery) and have your control plane be on it.
I sympathize with your reservations, it's just that at this stage of my life I am OK with having some loose ends in my infrastructure. To me CF is a good tradeoff (also I am pretty sure their free tier is quite generous? but maybe I am thinking different kinds of services) for the moment.
But in light of the recent French court order to main providers (CF included) to poison DNS in order to fight the anti-piracy circumventing users... maybe we should just move the entire control plane to Tor, or use the regular internet but doubly/triply encrypted, or use stuff like Freenet / NNCP etc.
I'll arrive at your conundrum soon enough though, I too want complete independence but I am slightly scared because the current internet infrastructure is constantly under attack to make it more malleable to censorship. Whatever people might think of piracy or any other scapegoat ("think of the children" seems to always be the adage) censorship is still not okay.
One solution I was thinking of, is to load balance using DNS, or at least to have failure recovery using DNS (one server goes down, update the DNS to point to the other server). I am fine with a small downtime.
The Sentry SDK is open source and easy to contribute to in my experience.
Yeah but who wants to contribute to an SDK for a service that you need to pay for? That would be like if Oracle DB was open to contribution
Sentry is self hostable. https://develop.sentry.dev/self-hosted/
Sentry is technically self-hostable, but they provide no deployment guidance beyond running the giant blob of services/microservices (including instances of postgres, redis, memcache, clickhouse, and kafka) as a single docker-compose thing. I get why they do this and think it's totally reasonable of them, but Sentry is a very complicated piece of software and takes substantially more work IME to both get up and running and maintain compared to other open-source self-hosted observability/monitoring/telemetry software I've had the pleasure of working with.
Our Linux devops engineer, who had not used Sentry before, set up a self-hosted Sentry in a day.
> Our Linux devops engineer, who had not used Sentry before, set up a self-hosted Sentry in a day.
I've also spent 1 hour setting up a Kubernetes cluster on a set of desktops I had lying around. This does not mean Kubernetes is simple or easy.
It’s easy to setup but painful to keep running and it’s hard to backup.
Yeah, it works for a time, but they don't support on-premise versions and they don't offer a Helm chart install, its all community based.
I tried it for well over a year, and there are so many moving parts and so many "best guesses" from the community that we had to rip it out. There's a lot of components, sentry, sentry-relay, snuba, celery, redis, clickhouse, zookeeper (for clickhouse), kafka, zookeeper (for kafka), maybe even elasticsearch for good measure. It did work for a time, but there are so many moving parts that required care and feeding it would inevitably break down at some point.
Problem is I can't ship data to their SaaS version because we have PHI and our contracts forbid it, even if scrubbed, so I had to settle on OTEL.
Day 1 vs day 2. That’s why the SaaS version exists.
PostHog and Rudderstack say the same things. They're not really self hosted. But that's the rub, if someone authored a good operator for Sentry, Sentry as a commercial service would cease to exist. That's not good in my opinion, they do real innovative stuff.
It's tough. We should have never done "give away the software, charge for hosting." The market, in every sense, has been telling you that you're really building value for AWS, for years, by doing that.
But not foss. It's using the BSL or FSL or whatever.
Although I do not like those licences, I would not care so much about 2yrs until it goes FOSS. Before all this rush development RRDTool and OpenTSDB was so slow, this whole thing seems rather ideological than substantial criticism. Now going down the licence rabbit hole based to criticise the original argument seems like a classical strawman.
Just want to say I appreciate your stance.
(also no one should feel like they have to contribute to our SDKs, but please file a ticket if somethings fucked up and we'll deal w/ it)
I was supporting a variation in my head of the "Yeah but who wants to contribute to an SDK for a service that you need to pay for?" claim.
You can self-host for free, so maybe @hahn-kev don't mind contributing to the SDK now.
For me, I refuse to contribute to an open-source SDK for a non-foss product. And I refuse to self-host a non-foss product.
Personally, I don't care if non-foss licenses speeds development. So yeah in my case it's ideological.
https://glitchtip.com/ is an Open Source form of Sentry created after they went closed source, if you are interested in something like that.
If I'm using something for employment, and the employer would get value out of it getting a bug fixed, why would I not fix it?
Sentry provides a great hosted service. You can self host if you like, but it’s nicer to let them do it
I've been using GlitchTip https://glitchtip.com with the Sentry SDKs and I couldn't be happier. Completely self-hosted, literally just the container and a db, requires zero attention.
I think all of us agree that OpenTelemetry's end-goal of making Observability vendor neutral is futuristic and inevitable. We can complain about it being hard to get started, bloated, etc but the value it provides is clear, esp, when you are paying $$$ to a vendor and stuck with it.
OpenStandards also open up a lot of usecases and startups too. SigNoz, TraceTest, TraceLoop, Signadot, all are very interesting projects which OpenTelemetry enabled.
The majority of the problem seems like sentry is not able to provide it's sentry like features by adopting otel. Getting involved at the design phase could have helped shaped the project that could have considered your usecases. The maintainers have never been opposed to such contributions AFAIK.
Regarding, limiting otel just to tracing would not be sufficient today as the teams want a single platform for all observability rather than different tools for different signals.
I have seen hundreds of companies switch to opentelemetry and save costs by being able to choose the best vendor supporting their usecases.
lack of docs, learning curve, etc are just temporary things that can happen with any big project and should be fixed. Also, otel maintainers and teams have always been seeking help in improving docs, showcasing usecases, etc. If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.
> If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.
Speaking as one of these maintainers, I would absolutely love it if even half of the vendors who depend heavily on OTel contributed back to the project that enables their business.
My own employer has done this for years now (including hiring people specifically so they can continue to contribute), and we're only at about 200 employees total. I like to imagine how complete the project would feel if Google or AWS contributed to the same degree relative to the size of their business units that depend on OTel.
IMO this boils down how one gets paid to understand or misunderstand something. A telemetry provider/founder is being commoditized by an open specification in which they do not participate in its development -- implied by the post saying the author doesn't know anyone on the spec committee(s). No surprise here.
Of course implementing a spec from the provider point of view can be difficult. And also take a look at all the names of the OTEL community and notice that Sentry is not there: https://github.com/open-telemetry/community/blob/86941073816.... This really isn't news. I'd guess that a Sentry customer should just be able to use the OTEL API and could just configure a proprietary Sentry exporter, for all their compute nodes, if Sentry has some superior way of collecting and managing telemetry.
IMO most library authors do not have to worry about annotation naming or anything like that mentioned in the post. Just use the OTEL API for logs, or use a logging API where there is an OTEL exporter, and whomever is integrating your code will take care of annotating spans. Propagating span IDs is the job of "RPC" libraries, not general code authors. Your URL fetch library should know how to propagate the Span ID provided that it also uses the OTEL API.
It is the same as using something like Docker containers on a serverless platform. You really don't need to know that your code is actually being deployed in Kubernetes. Use the common Docker interface is what matters.
> IMO this boils down how one gets paid to understand or misunderstand something.
I completely agree. The most charitable interpretation of this blog post is that the blogger genuinely fails go understand the basics of the problem domain, or worst case scenario they are trying to shitpost away the need for features that are well supported by a community-driven standard like OpenTelemetry.
Author here.
Y’all realize we’d just make more money if everyone has better instrumentation and we could spend less time on it, and more time on the product, right?
There is no conspiracy. It’s simple math and reasoning. We don’t compete with most otel consumers.
I don’t know how you could read what I posted and think sentry believes otel is a threat, let alone from the fact that we just migrated our JS SDK to run off it.
I use sentry selfhosted and started with Otel under gcloud and grafana and everything worked. Perfectly even, with low effort in dotnet.
However with sentry it’s still a pain and the visualization in sentry is kinda weird, since it goes beyond tracing.
And since sentry itself has no otel endpoint it is also really hard to do things like tail sampling.
Can you give more context on what you mean around sampling. What does an OTLP endpoint have to do with it?
multiple services, that should be sampled based on latency. this only works if you have something like: https://github.com/open-telemetry/opentelemetry-collector-co... or https://grafana.com/docs/tempo/latest/configuration/grafana-... in between.
we do use it quite extensivly besides the overhead, because we get lots of insights wenn we have a slow running endpoint.
I think that a number of Observability providers are looking at how they can add features and value to parts of monitoring that OTel effectively commoditizes. Thinking of the tail-based sampling implemented at Honeycomb for APM, or synthetic monitoring by my own team at Checkly.
"In 2015 Armin and I built a spec for Distributed Tracing. Its not a hard problem, it just requires an immense amount of coordination and effort." This to me feels like a nice glass of orange juice after brushing my teeth. The spec on DT is very easy, but the implementation is very very hard. The fact that OTel has nurtured a vast array of libraries to aid in context propagation is a huge acheivement, and saying 'This would all work fine if everyone everywhere adopted Sentry' is... laughable.
Totally outside the O11y space, OTel context propagation is an intensely useful feature because of how widespread it is. See Signadot implementing their smart test routing with OpenTelemetry: https://www.signadot.com/blog/scaling-environments-with-open...
An argument that OpenTelemetry is somehow 'too big' is an example of motivated reasoning. I can understand that A Guy Who Makes Money If You Use Sentry dislikes that people are using OTel libraries to solve similar problems.
Context propagation and distributed tracing are cool OTel features! But they are not the only thing OTel should be doing. OpenTelemetry instrumentation libraries can do a lot on their own, a friend of mine made massive savings in compute efficiency with the NodeJS OTel library: https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...
Author here.
OpenTelemetry is not competitive to us (it doesn’t do what we do in plurality), and we specifically want to see the open tracing goals succeed.
I was pretty clear about that in the post though.
I think that it's disingenuous to say OpenTelemetry and Sentry aren't in competition. I think it would be good news for Sentry if DT were split from the project, and instrumentation and performance monitoring weren't commoditized by broad adoption of those parts of the OpenTelemetry project.
I think you, the author, stand to benefit directly from a breakup of OpenTelemetry, and a refusal to acknowledge your own bias is problematic when your piece starts with a request to 'look objectively.'
We just rewrote our most heavily used SDK to run on top of OTel. What do we gain from it failing?
We also make most of our revenue from errors which don’t have an open protocol implementation outside of our own.
Your error stuff is pretty damn cool btw.
Personally, I like OpenTelemetry, nice standardised approach. I just wished the vendors would have better support for the semantic conventions defined for a wide variety of traces.
I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.
My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.
> My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.
This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations). I don't understand why the opentelemetry collector forces me to re-implement the same settings for all of them and import separate libraries that all seem to do the same thing by default. Besides sdks and processors, I don't understand the need for these abstractions to persist throughout the pipeline. I'm running one collector, so why do I need to specify where my collector endpoint is 3 different times? Why do I need to specify that I want my blobs batched 3 different times? What's the point of having opentelemetry be one project at all?
My guess is this is just because opentelemetry started as a tracing project, and then became a logs and metrics project later. If it had started as a logging project, things would probably make more sense.
> This is my main gripe too. I don't understand why {traces, logs, metrics} are not just different abstractions built on top of "events" (blobs of data your application ships off to some set of central locations).
By design, they cannot be abstractions of the single concept. For example, logs have a hard requirement on preserving sequential order and session and emitting strings, whereas metrics are aggregated and sampled and dropped arbitrarily and consist of single discrete values. Logs can store open-ended data, and thus need to comply with tighter data protection regulations. Traces often track a very specific set of generic events, whereas there are whole classes of metrics that serve entirely different purposes.
Just because you can squint hard enough to only see events being emitted, that does not mean all event types can or should be treated the same.
> Just because you can squint hard enough to only see events being emitted
If you squint hard enough you can fool yourself into thinking all metrics have the same availability requirements. It’s not the case. There are plenty of time series data metrics where arbitrarily dropping them or aggregating them would throw off your alerting entirely.
Indeed one would have to squint to the point of blindness.
Logs are single point in time, flat, linear sequence, never dropped (at best you'd collapse sequences of identical, repeated logs). Think dmesg, syslog, systemd journald/journalctl.
Metrics are statistical numeric data, which can be series, average, histogram, bucket... aggregation/reduction can be done on the fly/before leaving the observed thing. Some can be dropped, but it is important that dropping anything stays statistically meaningful.
Spans are a duration in time representing some operation, with metadata (numeric, stringy, structured even) attached pertaining to that operation. Spans have a parent, forming a tree, which forms a trace. Spans can be deduped and/or sampled, with specific occurences forcefully kept (e.g 500 error) or dropped (e.g healthcheck).
They are fundamentally different (technical) primitives a.k.a (functional) tools to observe different things and serve different goals.
Right, the point I’m making is logs, metrics, traces, these concepts are views of data, with a pretty hazy relationship to the shape of the data itself or the handling requirements. Any assumption you make about them as a category (logs are unstructured, traces are sampled, metrics can be aggregated) is wrong nearly as much as it’s right.
> Right, the point I’m making is logs, metrics, traces, these concepts are views of data (...)
Not really. Logs are fundamentally different than operational metrics, which are fundamentally different than business/behavioral metrics, which are fundamentally different than traces, etc etc etc.
This is not a matter of "view". It's the result of completely different system requirements. They are emitted differently, they are processed/aggregated differently, they are stored differently, they are consumed differently.
Even within business metrics types, which is already a specialized type of metrics, you have fundamentally different system requirements. Click stream metrics mix traits of tracing with logging and metrics, and have very specific requirements regarding data protection.
They are all distinct observability features. They are not the same. At all. This is not up for debate.
> Click stream metrics mix traits of tracing with logging and metrics
This sounds like you are admitting my point? My point is not “there is no difference between anything” my point is that the 3 buckets of “metrics, logs, traces” are neither all-encompassing in terms of types of telemetry one might feasibly want to emit, nor are they mutually exclusive. Here is perhaps a better writeup of what I mean
https://open.substack.com/pub/isburmistrov/p/all-you-need-is...
> If you squint hard enough you can fool yourself into thinking all metrics have the same availability requirements.
I'm sorry, I have no idea what point you tried to make.
Something I mention any time I'm introducing OpenTelemetry is that it's an unfinished project, a huge piece being the unifying abstractions between those signals.
In part this is a very practical decision: most people already have pretty good tools for their logs, and have struggled to get tracing working. So it's better to work on tools for measuring and sending traces, and just let people export their current log stream via the OpenTelemetry collector.
Notably the OTel docs acknowledge this mismatch between current implementation and design goals: https://opentelemetry.io/docs/specs/otel/logs/#limitations-o...
If you're using OTLP, SDKs only require you specify the endpoint once, the signal specific settings are for if you want to send them to different places.
The way you process/modify metrics vs logs vs traces are usually sufficiently different that there's not much point in having a unified event model if you're going to need a bunch of conditions to separate and process them differently. Of course, you can still use only one source (logs or events) and derive the other 2 from that, though that rarely scales well.
Plus, the backends that you can use to store/visualize the data usually are optimized for specific signals anyways.
Well, only when you use the OTLP protocol and otel-collector. In other cases you would need a (span) exporter to multiple targets at the same time. But yeah, otel-collector would be the best approach to achieve this.
It's a bit confusing but here's my best attempt to explain it:
- Trace events (span events) are intended to be structured events and possibly can have semantic attributes behind them - similar to how spans have semantic attributes. They're great if your team is all bought in on tracing as an organization. They will colocate your span events with your parent span. In practice they have poor searchability/indexing in many tools, so they should only be used if you only intend to use them when you will discover the span first. (Ex. debug info that is only useful to figure out why a span was very slow and you're okay not being easily searchable)
- Log records are plain old logs, they should be structured, but don't have to be, and there isn't a high expectation of structured data, much less semantic attributes. Logs can be easily adopted without buying into tracing.
- Events API, this is an experimental part of Otel, but is intended to be an API that emits logs with the expectation of semantic conventions (and therefore is also structured). Afaik end users are not the intended audience of this API.
Many teams fall along the spectrum of logs vs tracing which is why there's options to do things multiple ways. My personal take is that log records are going to continue to be more flexible than span events as an end-user given the state of current tools.
Disclaimer: I help build hyperdx, we're oss, otel-based observability and we've made product decisions based on the above opinions.
Can you give an example of the missing semantic conventions?
I have been trying to find an equivalent for `tracing` first in Python and this week in TypeScript/JavaScript. At my work I created an internal post called "Better Python Logging? Tracing for Python?" that basically asks this question. OpenTelemetry was also what I looked at and since I have looked at other tooling.
It is hard to explain how convenient `tracing` is in Rust and why I sorely miss it elsewhere. The simple part of adding context to logs can be solved in a myriad of ways, yet all boil down to a similar "span-like" approach. I'm very interested in helping bring what `tracing` offers to other programming communities.
It very likely is worth having some people from the space involved, possibly from the tracing crate itself.
We’ll fund solving this as long as the committees agree with the goal. We just want standard tracing implementations.
(Speaking on behalf of Sentry)
I have surveyed this landscape for a number of years, though I'm not involved enough to have strong opinions. We're running a lot of Prometheus ecosystem and even some OpenTelemetry stacks across customers. OpenTelemetry does seem like one of these projects with an ever expanding scope. It makes it hard to integrate parts you like and keep things both computing-wise and mentally lightweight without having to go all-in.
It's not anymore about hey, we'll include this little library or protocol instead of rolling our own, so we can hope to be compatible with a bunch of other industry-standard software. It's a large stack with an ever evolving spec. You have to develop your applications and infrastructure around it. It's very seductive to roll your own simpler solution.
I appreciate it's not easy to build industry-wide consensus across vendors, platforms and programming languages. But be careful with projects that fail to capture developer mindshare.
Could you clarify further on your reservations, please? As a programmer I appreciate only including a library in my project, give it a set OTLP settings (host, port, URI) and move on.
What difficulties did opting into OTel give you?
The main interest I've seen in OTel from Android engineers has been driven by concerns around vendor lock-in. Backend/devops in their organisations are typically using OTel tooling already & want to see all telemetry in one place.
From this perspective it doesn't matter if the OTel SDK comes bundled with a bunch of unnecessary code or version conflicts as is suggested in the article. The whole point is to regain control over telemetry & avoid paying $$$ to an ambivalent vendor.
FWIW, I don't think the OTel implementation for mobile is perfect - a lot of the code was originally written with backend JVM apps in mind & that can cause friction. However, I'm fairly optimistic those pain points will get fixed as more folks converge on this standard.
Disclaimer: I work at a Sentry competitor
At the risk of hijacking the comments, I've been trying to use OTel recently to debug performance of a complex webpage with lots of async sibling spans, and finding it very very difficult to identify the critical path / bottlenecks.
There's no causal relationships between sibling spans. I think in theory "span links" solves this, but afaict this is not a widely used feature in SDKs are UI viewers.
(I wrote about this here https://github.com/open-telemetry/opentelemetry-specificatio...)
I don't believe this is a solved problem, and it's been around since OpenTracing days[0]. I do not think that the Span links, as they are currently defined, would be the best place to do this, but maybe Span links are extended to support this in the future. Right now Span links are mostly used to correlate spans causally _across different traces_ whereas as you point out there are cases where you want correlation _within a trace_.
[0]: https://github.com/opentracing/specification/issues/142
I was underwhelmed by the max size for spans before they get rejected. Our app was about an order of magnitude too complex for OTEL to handle.
Reworking our code to support spans made our stack traces harder to read and in the end we turned the whole thing off anyway. Worse than doing nothing.
As per the spec there's no formal limits on size, although in practice there can be in several levels:
- Your SDK's exporter
- Collector processors and general memory limitations based on deployment
- Telemetry backend (this is usually the one that hits people)
Do you know where the source of this rejection happened? My guess would be backend, since some will (surprisingly) have rather small limits on spans and span attributes.
Sounds like a knob you can turn, from my practice at least.
A recent example of OTel confusion.
I could for the life of me not get the Python integration send traces to a collector. Same URL, same setup same API key as for Nodejs and Go.
Turns out the Python SDK expect a URL encoded header, e.g. “Bearer%20somekey” whereas all other SDKs just accept a string with a whitespace.
The whole split between HTTP, protobuf over HTTP and GRPC is also massively confusing.
The silent failure policy of OTEL makes flames shoot out of the top of my head.
We had to use wireshark to identify a super nasty bug in the “JavaScript” (but actually typescript despite being called opentelemetryjs) implementation.
And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).
In the end I prefer the network topology of StatsD, which is what we were migrating from. Let the collector do ALL of the bookkeeping instead of faffing about. OTEL is actively hostile to process-per-thread programming languages. If I had it to do over again I’d look at the StatsD->Prometheus integrations, and the StatsD extensions that support tagging.
> And OTEL is largely unsuitable for short lived processes like CLIs, CI/CD. And I would wager the same holds for FaaS (Lambda).
Not necessarily true f.ex. in one of my hobby Golang projects I found out that you can cleanly shutdown the OTel collector so it flushes its backlog of traces / metrics / logs so I was able to get telemetry reading even for CLI tool invocations that lasted 5-10 secs (connect to servers, get data, operate on it, put it someplace else, quit).
But now that you mention it, it would be nasty if that's not the default behavior indeed.
> OTEL is actively hostile to process-per-thread programming languages
Can you explain why, please?
Yeah. And Otel has actually pretty nice debugging. You just need to set the right environment variable. But on prod it will blow up your logs
Sounds like a problem with the Python sdk
Well actually. They (python SDK maintainers) argue their implementation is the correct one according to the spec. See this issue thread for example.
https://github.com/open-telemetry/opentelemetry-specificatio...
There are more. This is a symptom of a how hard it is to dive into Otel due to its surface area being so big.
> Well actually. They (python SDK maintainers) argue their implementation is the correct one according to the spec. See this issue thread for example.
The comment section of that issue gives out contrarian vibes. Apparently the problem is that the Python SDK maintainers refuse to support a use case that virtually all other SDKs support. There are some weasel words that try to convey the idea that half the SDKs are with Python while in reality the ones that support the choices followed by the Python SDK actually support all scenarios.
From the looks of it, the Python SDK maintainers are purposely making a mountain out of a molehill that could be levelled with a single commit with a single line of code.
I guess you word it better than I did.
As a user it feels very weird to wade into threads like this to find a solution to your problem.
The power of Otel is it being an open standard. But the practice shows the implementation of that standard / spec leads to all kinds of issues and fiefdoms
It resonates. As an intern I had to add OTEL to a Python project and I had to spend a lot of time in the docs to understand the concepts and implementation. Also, the Python impl has a lot of global state that makes it hard to use properly imo.
> As an intern I had to ${DO_SOME_PROJECT} and I had to spend a lot of time in the docs to understand the concepts and implementation
That sounds like every single run-of-the-mill internship.
That's fair but I'll say that the time and number of concepts you have to deal with before going into the code, per the docs; is quite big and I think the critic in the article is warranted.
> That's fair but I'll say that the time and number of concepts you have to deal with before going into the code, per the docs; is quite big (...)
That's ok. It's the same for any domain you're just starting to learn. If anyone wishes to, say, onboard onto Cloudwatch or Azure Application Insights they will say exactly the same thing.
Sounds like you have been overzealous about it. I just read an intro article or two -- maximum 15 minutes in total -- and I had an idea about trace spans vs. logging vs. metrics and I was on my merry way 1-2h later. We're talking the first time I needed OTel.
I am not humble-bragging at all, in fact I was quite dim back then (health problems reduced my focus and clarity of thought), I am saying that over-preparing can make everything seem untenable. So just don't.
Tracing requires keeping mappings for tracing identifiers per request. I don't know you do that without global state unless you want the tracing identifiers to pollute your own internal apis everywhere.
Many frameworks have the idea of a context" for this, that holds per-request state, following your reques through the system. Functions that don't care about the context just pass it on to whatever they call.
I think Go was smart to make this concept part of the standard library, as it encouraged frameworks to adopt it as well.
Big agree on Golang. I had colleagues working with 5-6 JS microservices and they really wanted to keep span IDs so they can have sub-spans and unify semantic requests together (they also included background jobs processing) and they had to bring up a Redis container just to keep mappings of pieces of data every microservice has access to (e.g. order_id) and the actual span ID. It worked well in the end but it needed to be fully manually taken care of.
I understand that but if you look at the Python implementation (or at least as it was 1-2 years ago), you have a lot of god objects that hack __new__ which leads to hidden flows when you create new instances of tracers for example. I'm not saying I have a better idea but when you put that together with the docs and the (at the time) very bare examples, it's just annoying.
100% agree.
Every time I tried to use OT I was reading the doc and whispering "but, why? I only need...".
Yeah I was going down this path for a side project I was getting going and spent a couple days of after-work time exploring how to get just some basic traces in OT and realized it was much more than I needed or cared about.
There is a huge whole in using spans as they are specified. Without separating the start of a span from the end of a span you can never see things that never complete, fail hard enough to not close the span, or travel through queues. This is a compromise they made because typical storage systems for tracing aren't really good enough to stitch them all back together quickly. Everyone should be sending events and stitching it all together to create the view. But instead we get a least common denominator solution.
As a contributor to (and consumer of) OpenTelemetry, I think critique and feedback is most welcome - and sorely needed.
But this ain’t it. In the opening paragraphs the author dismisses the hardest parts of the problem (presumably because they are human problems, which engineers tend to ignore), and betrays a complete lack of interest in understanding why things ended up this way. It also seems they’ve completely misunderstood the API/SDK split in its entirety - because they argue for having such a split. It’s there - that’s exactly what exists!
And it goes on and on. I think it’s fair to critique OpenTelemetry; it can be really confusing. The blog post is evidence of that, certainly. But really it just reads like someone who got frustrated that they didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage. I wish I could say this was unusual amongst engineers, but it isn’t.
Author here.
That’s kind of making my point for me fwiw. It’s too complicated. I consider myself a product person so this is my version of that lens on the problem.
I’m not dismissing the people problem at all - I actually am trying to suggest the technology problem is the easier part (eg a basic spec). Getting it implemented, making it easy to understand, etc is where I see it struggling right now.
Aside this is not just my feedback, it’s a synthesis of what I’m hearing (but also what I believe).
No dog in the fight here, but… you're saying that one of the top guys at a major observability shop didn’t understand Open Telemetry, then that’s saying much more about OT than it does about his skills or efforts to understand. After all, his main point is that it’s complex and overengineered, which is the key takeaway for curious bystanders like me, whether every detail is technically correct or not.
> it just reads like someone who […] didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage.
And what about average developers asked to “add telemetry” to their apps and libraries? Their patience will be much lower than that.
Not necessarily defending the content (frankly it should have had more examples), but I relate to the sentiment. As a developer, I need framework providers to make sane design decisions with minimal api surface, otherwise I’d rather build something bespoke or just not care.
OTel is very easy to add.. I've added it to several Go projects. For some frameworks like .NET you can do it automatically. The harder/more annoying part is setting up a viewer/collector like Jaeger. I've done that too but just in memory and it fills up quick.
For my small scale projects, Openobserve.ai has been super helpful. It ships as a single binary and (in non h/a setup) saves traces/logs/metrics to disk. I just set it up as a systems service and start sending telemetry via localhost. Code at https://github.com/bbkane/shovel_ansible/
Same, I love OpenObserve and shill for it a lot. Maybe more advanced users have more use-cases and I am blind to it but all three main pillars of OTel (tracing spans, metrics, logs) it serves very well for my purposes.
Thanks for this. I hadn't heard of openobserve before. It looks awesome.
> After all, his main point is that it’s complex and overengineered, which is the key takeaway for curious bystanders like me, whether every detail is technically correct or not.
This is a gross over-simplification that will leave you with a very skewed view of reality. As a programmer I only ever had to add a library, configure the OTLP endpoint details (host, port, URI, sometimes query parameters as well) and it was done.
It might be "complex and overengineered" if you want to contribute to the OTel libraries but as a programmer-user you are seeing practically none of it. And I would also challenge the "complex and overengineered" part but for now I am not informed enough to do it.
indeed, it just sounds like they're complaining they don't have a seat at the table...
Otel is indeed quite complex. And the docs are not meant for quick wins...
Otelbin [0] has helped me quite a bit in configuring and making sense of it, and getting stuff done.
That looks pretty cool! OpenTelemetry Collector configuration files are pretty confusing. Do like the collector, though. Makes it easy to sent a subset of your telemetry to trusted partners.
Anyone else finding this very difficult to read? I’d really recommend feeding this through a grammar checker, because poor grammar betrays unclear thinking.
So you’re saying it makes my thinking more clear? :)
This is what happens when you use a tool designed for authoring code to also author content.
"betrays" means to expose, to be evidence of, particularly unintentionally.
i.e. "poor grammar unintentionally exposed unclear thinking"
I think there are two separate perspectives. For developers Open Telemetry is a clear win - high-quality vendor agnostic instrumentation backed by a reputable orgs. I instrumented with traces many business critical repos at my company (major customer support SaaS) with OTEL in Ruby, Python, JS. Not once was I confused/blocked/distracted by the presence of logs/metrics in the spec. However, can’t say much from the observability vendor perspective trying to be fully compatible with OTEL spec including metrics/logs. Article mentions customers having issues with using tracing instrumentation - it would’ve been great to back this up with corresponding github issues explaining the problems. Based on the presented JS snippet (just my guess) maybe the issue is with async code where the “span.operation” span gets immediately closed w/o waiting for the doTheThing()? Yeah - that’s tricky in JS given its async primitives. We ended up just maintaining a global reference to the currently active span and patching some OTEL packages to respect that. FWIW Sentry JS instrumentation IS really good and practical. Would have been great if Sentry could donate/contribute/influence to OTEL JS SIG with specific improvements - would be a win-win. As much as I hate DataCanine pricing they did effectively donated their Ruby tracing instrumentation to OTEL which I think is one of the best ones out there.
This seems to be more of a branding problem than anything.
OP (rightfully) complains that there is a mismatch between what they (can) advertise ("We support OTEL") and what they are actually providing to the user. I have the same pain point from the consumer side, where I have to trial multiple tools and service to figure out which of them actually supports the OTEL feature set I care about.
I feel like this could be solved by introducing better branding that has a clearly defined scope of features inside the project (like e.g. "OTEL Tracing") which can serve as a direct signifier to customers about what feature set can be expected.
Yes! Its a bit deeper than that but its fundamentally a packaging issue.
OTel is flawed for sure, but I don't understand the stance against metrics and logs. Traces are inherently sampled unless you're lighting all your money on fire, or operating at so small a scale that these decisions have no real impact. There are kinds of metrics and logs which you always want to emit because they're mission-critical in some way. Is this a Sentry-specific thing? Does it just collapse these three kinds of information into a single thing called a "trace"?
I mean, when you're the one selling the gas to light that money on fire you have a vested interest in keeping it that way right?
I do agree that logging and spans are very similar, but I disagree that logs are just spans because they aren't exactly the same.
I also agree that you can collect all metrics from spans and, in fact, it might be a better way to tackle it. But it's just not feasible to do so monetarily so you do need to have some sort of collection step closer to the metric producers.
What I do agree with is that the terminology and the implementation of OTEL's SDK is incredibly confusing and hard to implement/keep up to date. I spent way too many hours of my career struggling with conflicting versions of OTEL so I know the pain and I desperately wish they would at least take to heart the idea of separating implementation from API.
Food for thought- the subjective nature of both of those is exactly why it shouldn’t be bundled.
> Traces are inherently sampled unless you're lighting all your money on fire
You can burn a lot of money with logs and metrics too. The question is how much value you get for the money you throw on the burning pile of monitoring. My personal belief is that well instrumented distributed tracing is more actionable than logs and metrics. Even if sampled.
(Disclaimer: I work at sentry)
I actually take the opposite approach. In my experience, well instrumented metrics and finely tuned logs are more actionable than distributed traces! Interesting how that works out.
I believe on the infrastructure side that might be correct. Within applications that doesn’t match my experience. In many cases the concurrent nature of servers makes it impossible to repro issues and narrow down the problem without tracing or trace aware logs.
With only sampled traces though it’s very hard to understand the impact of the problem. There are some bad traces but is it affecting 5%, 10% or 90% of your customers. Metrics shine there.
Whether it is affecting 5% or 10% of your customers, if it is erroring at that rate you are going to want to find the root cause ASAP. Traces let you do that, whereas the precise number does nothing. I am a big supporter of metrics but I don't see this as the use case at all.
(not your OP) This is true, but I find that metrics are useful whether something is going wrong or not (metrics that show 100% success are useful in determining baselines and what "normal" is), whereas collecting traces _when nothing is going wrong_ is not useful -- it's just taking up space and ingress, and thus costing me money.
My typical approach in the past has been to use metrics to determine when something is going wrong, then enable either tracing or logs (usually logs) to determine exactly what is breaking. For a dev or team that is highly connected to their software, simply knowing what was recently released is enough to zero in on problems without relying upon tracing.
Traces can be useful, but they're expensive relative to metrics, even if sampled at a very low rate.
Yes, and:
Not all problems result in error traces to analyse.
Example, you release buggy client that doesn't call "POST /order/finalize" when it should.
There are no error traces, there are just missing HTTP requests. Metrics reveal that calls to "POST /order/finalize" for iOS apps are down 50% WoW.
Strange example, you'd think you want to fix this as quickly as humanly possible, no?
Also we don't sample traces, it's a fire hose of data aimed at the OTel collector. We do archive them / move them to colder and cheaper storage after a little time though, and we found that a viable money-saving strategy and a good balance overall.
Not all problems result in error traces to analyse.
Example, you release buggy client that doesn't call "POST /order/finalize" when it should.
There are no error traces, there are just missing HTTP requests. Metrics reveal that calls to "POST /order/finalize" for iOS apps are down 50% WoW.
How? Distributed traces show you what’s going on with a request in detail. Metrics tell you zero detail, by definition. Logs show you whenever people have decided to log manually which is usually very incomplete
> OTel is flawed for sure, but I don't understand the stance against metrics and logs.
Even if you don't want to consider the privacy concerns: telemetry wastes quite some data of your internet connection.
Hey, this isn't the sort of telemetry we are talking about with OTel.
About the only "privacy concern" with otel is that you are probably shipping traces/metrics to a cloud provider for your internal applications. This isn't the sort of telemetry getting baked into ms or google that is used to try and identify personal aspects of individuals, this is data that tells you "Foo app is taking 300ms serving /bar which is unusual".
After I added OTel to an open source project I run, I spent a bit of time arguing with someone about telemetry - they kept saying they didn't opt in and that we need to inform our users about it, etc., and I kept saying no, that's not the same type of telemetry. I wonder how common this misconception is.
This is the second time I've seen this misconception come up in HN and I've definitely seen it in Reddit at least once.
OpenTracing was a much clearer name, especially for those of us who really don't care about doing logging or metrics through OTel.
Client-side transport is pretty unusual with OTel. I think almost everybody is sending things from the server side, so I don’t think your concern is usually relevant.
I think you are more talking about RUM which isn't yet supported by OpenTelemetry. I think they are working on it.
I am not sure if it will support session replays like some vendors like Sentry or New Relic offer. Technically, I think session replays (rrweb etc) is pretty cool but as a web visitor I am not a fan.
I've used Otel quite a bit (in JVM systems) and honestly didn't know it did more than tracing.
That said, I think this rot comes from the commercial side of the sector -- if you're a successful startup with one product (e.g. graphing counters), then your investors are going to start beating you up about why don't you expand into other adjacent product areas (e.g. tracing). Repeat previous sentence reversed. And so you get Grafana, New Relic, et al). OpenTelemetry is just mirroring that arrangement.
You can absolutely use just the OTel APIs and use something else besides the OTel SDK. Here is a blog post about how we did it with eBPF: https://odigos.io/blog/Integrating-manual-and-auto
I only learned about OT after Prometheus announced some deeper integration with it. Reading OT docs about metrics feels like every little problem has a dedicated solution in the OT world, even if a more generalised one already covers it. Which is quite striking coming from the Prometheus world.
OpenTelemetry is interesting, On one side it is designed as the "commodity feeder" to number of proprietary backends as DataDog, on other hand we see good development of Open Source solutions as SigNoz and Coroot with good Otel support.
1. The main reason I want to use otel is so I can have one sidecar for my observability, not three, each with subtly different quirks and expectations. (also the associated collection/aggregation infrastructure)
2. I honestly think the main reason otel appears so complex is the existing resources that attempt to explain the various concepts around it do a poor job and are very hand-wavey. You know the main thing that made otel "click" for me? Reading the protobuf specs. Literally nothing else explained succinctly the relationships between the different types of structure and what the possibilities with each were.
Your point 2 would make for a very interesting blog post worthy of HN submitting. :)
This caught my eye:
> Logs are just events - which is exactly what a span is, btw - and metrics are just abstractions out of those event properties. That is, you want to know the response time of an API endpoint? You don't rewind 20 years and increment a counter, you instead aggregate the duration of the relevant span segment. Somehow though, Logs and Metrics are still front and center.
Is anyone replacing logs and metrics with traces?
imo Honeycomb pioneered this, and its the right baseline. There are limitations to it of course, and certainly its been done before at BigCo's that can afford to build the tech, but its extremely powerful.
The main argument for metrics beyond traces is simply a technology implementation - its aggregation because you cant store the raw events. That doesnt mean though you need a new abstraction on those metrics. They're still just questions you're asking of the events in the system, and most systems are debuggable by aggregation data points of spans or other telemetry.
As for logs, they're important for some kinds of workloads, but for the majority of companies I dont think they're the best solution to the problem. You might need them for auditability, but its quite difficult to find a case where logs are the solution to debug a problem if you had span annotations.
Absolutely yes
> That means what we actually want is a way to say “hey OpenTelemetry SDK, give us all the current spans in the buffer”.
Isn’t this exactly what the SpanExporter API is for? This is in the Go SDK, I suppose it may not be available in other SDKs.
I have used this API to convert OTel spans into log messages as we currently don’t have a distributed tracing vendor.
I keep checking in on OpenTelemetry every few months to see if the bits we need are stable yet. There’s been very little progress on the things we’re waiting for.
I don’t follow closely enough to comment on possible causes.
What I do know is that the surface area of code and infrastructure that telemetry touches means adopting something unfinished is a big leap of faith.
What pieces are you looking to be stable (and what's your definition of stable)?
Asking because some pieces, like the Collector, aren't technically a stable 1.0 yet, but the bar for stability is extremely high, and in practice it's far more stable than most software out there.
But there are other pieces, such as a language's support for a specific concept, that are truly experimental or even still in-development.
IMO you might be looking at the wrong signals, OTel is quite successful today and I had zero breakages or compatibility problems for at least 2 years at this point.
Perhaps the real problem with OTel (IMO) is it's trying to be everything for everyone and every language. It's trying to have a common interface so that you can write OTel in Java or Javascript, python or rust, and you basically have the exact same API.
I suspect OP is seeing this directly when talking about the cludgyness of the Javascript API.
The Otel spec does give leeway for language-specific details, and the SDKs are not as uniform as you'd expect (ex. Java's agent configuration is very different from Node's auto instrumentation). I'm not denying that there's SDK specs to adhere to, but the abstraction complexity in Otel is really from the amount of flexibility they've tried to build into the SDK for better or for worse.
The flexibility benefits vendors (I work for HyperDX, based on otel) - as it allows for a lot of points of extensibility to build a better experience for end users by extending the vanilla SDK functionality. However, it creates a lot of overhead for end-users trying to adopt the "vanilla" SDKs out of the box as there's 5 layers of abstractions that need to be understood before getting things started (which is bad!)
I've only seen the DX of Otel improve over time across the ecosystems they support - so I suspect we'll get there soon enough.
Are they basically just saying that the OpenTelemetry client APIs should be split from the rest of the pieces of the project, and versioned super conservatively?
The simple API they describe is basically there in OTel. The API is larger, because it also does quite a few other things (personally, I think (W3C) Baggage is important too), but as a library author I should need only the client APIs to write to.
When implementing, you're free to plug in Providers that use OpenAPI-provided plumbing, but you can equally well plug in Providers from DataDog or Sentry or whatever.
Unless I'm missing something, any further complaints could be solved by making sure the Client APIs (almost) never have backward-incompatible changes, and are versioned separately.
It’s a bit deeper than that. The SDKs that library authors implement need to be extemely minimal. The collection libraries that vendors implement based on imo should also be minimal.
OTLP imo doesn’t even need to be part of the spec.
But minimal would also mean focusing on solving fewer problems as a whole. Eg OpenTracing plus OpenMetrics plus OpenLogs. I only need one of those things.
Well, on OTLP they seem to agree with you: https://opentelemetry.io/blog/2023/otel-arrow/
that just sounds like a branding problem though...
OTLP has been quite useful especially in metrics to get a format that doesn't really have any sacrifices/limitations compared to all the other protocols.
It is! But to prove your point, OTLP is actually just the transport protocol (Open Telemetry Transport Protocol). Its one of _so many things_ its trying to address. All of those things might be probems, but not everyone has those same problems (vendors, customers, and lib authors), and bundling them all into one umbrella just screams for me.
I actually have no need for a standard metrics implementation, just as an example. I never have, and I'd argue Sentry (as a tech company) never has. We built our own abstraction and/or used a library. That doesnt mean others don't, and it doesnt mean it shouldnt be something people solve, but bundling "all telemetry problems" into one giant design committee is a fundamental misstep imo.
I read all your comments in this thread and I am still confused as to what your actual problem is.
Would it help you if all language-specific OTel libraries had 3 parts: metrics, log, traces? Namely you want finer-grained opt-in approach for the programmer-users? Or is it something else you have a problem with?
Saying OTel is a failure of a design committee process is to me blowing hot air; you are telling us your conclusion and I personally care exactly zero about your conclusion. I want to see your process of arriving at the said conclusion. And so far neither the OP nor your comments here gave me almost any enlightenment in this regard.
Are you open to clarifying further?
I want one fundamental thing: the ability to ask library authors to implement span annotations in their projects. Today that ask comes with way too much baggage. The SDKs are often extremely complex ("bloated", other's words, not mine), and on top of that, most of these library authors don't use or care about the standard. The latter can be fixed w/ funding, the former is what I'm concerned complaining about.
What I mean by that in practical terms is very easily articulated when we look at something like bundling of logs. We've had standardzed formats, adapters, and transports for logging for decades. Could they be better? Sure. Aint no future though where someone builds an SDK and no one ever again innovates or has an alternative path to achieving it. The same is true for everything, logs are just an an easy thing to pick on.
Take that a little further - why do I need a logging SDK bundled with a tracing SDK? I mean that very literally - why _must_ they be bundled? What do we get out of it?
You can argue that "just dont use the other parts", but thats an academic argument at best. In practice that just means you've got an overloaded SDK with a bunch of bias associated with it - in this case (in my opinion) a bunch of companies who don't really innovative pushing legacy telemetry concepts on the masses. That might not sound like a problem, but a developer can barely make sense of the docs, and even at the most basic level, I shouldnt need to make an argument about software branching complexity to other software engineers. They're simply trying to do too many things.
What that ultimately boils down to me though is: what problem is it solving? Tracing is a serious value add to _most_ stacks these days that requires immense amount of coordination to solve. It's not maintainable (easily) through patching other's code, which goes back to why I personally agree w/ OpenTracing's original pitch. However, why, with that goal unsolved, did we attach a bunch of other loosely-but-mostly unrelated problems to the spec? At the very least, problems that dont apply universally to the same audience. Why did we need a spec that bundled all of these problems, and continues to try to bundle more?
So I go back to the core problem: we want universal span annotations implemented across the ecosystem. I dont see what OTel is doing as the most effective way to achieve that goal, and I dont see the other goals they're trying to solve for as ones that are actually that important to most developers (and quite frankly, I could not tell you what many of those goals even are).
A lot of the development of OTel looks like two things:
1) Startups wanting easy access to telemetry, often without any product differention, thus, stakeholders who I dont find totally relevant to the conversation. I'm sure that comes off as me being an asshole, but its how I feel.
2) Big vendors trying to push consolidation on the customers, all competing with the exact same products (think Datadog and all its copycats, again often with no product innovation). This isnt totally a problem except it amounts to them pushing fragmented legacy concerns (such as outdated logging concepts) downstream.
All I want is great quality of data for both developers of libraries and developers of applications - those are the customers. I dont see those customers being serviced the best they could be, and I dont see the goals being met _because_ of the distraction, lack of focus, and as far as I can tell, lack of vision of the project.
I think anyone who's working on OTel, if they genuinely had the best interests of the developers in mind, would be hard pressed to answer why Sentry's support isn't an extremely desirable thing given our market reach. The people that care about the project want us involved (and if you're reading this, thank you for constantly pushing us), and I want us to also be involved. So far though we're constantly struggling to actually get an uninstrusive implementation in place that can work with our product, and what I'm asking for would solve for that, but to me looks like a cultural problem and not a technical problem...
The issue is we currently dont see a great incentive to fund a bunch of piecemeal standards that arent relevant to our product, and many library authors are not going to naturally invest into this. Even more say, many of these authors that I've talked to have no excitement what so ever about the standard, or worse, an active distaste. People can say they're wrong, but frankly, that doesnt matter. You dont succeed by thinking someone else is wrong, you succeed by building what your customers wnat.
That is why I would like to see the project focus on problems that people actually have, and do it in a way that doesn't create tradeoffs for developers. To me those problems are the ones that aren't achievable without this level of coordination. They are not shimming another log or metrics collector in place. Those things might be relevant to some people, and thats fine, but we dont need one be-all-end-all project to encompass all sorts of fuzzy semi-related problems.
I may or may not be making sense here, and I'm happy to chat about it more, though HN is probably not a good venue for that.
> Its not a hard problem, [...]. At its core its structured events that carry two GUIDs along with them: a trace ID and a parent event ID. It is just building a tree.
I've always wondered, what's the point of the trace ID? What even is a trace?
- It could be a single database query that's invoked on a distributed database, giving you information about everything that went on inside the cluster processing that query.
- Or it could be all database calls made by a single page request on a web server.
- Or it could be a collection of page requests made by a single user as part of a shopping checkout process. Each page request could make many outgoing database calls.
Which of these three you should choose merely depends on what you want to visualize at a given point in time. My hope is that at some point we get a standard for tracing that does away with the notion of trace IDs. Just treat everything going on in the universe as a graph of inter-connected events.
I think they meant "an event ID and a parent event ID".
I actually meant trace ID and parent event ID (and ID was inferred). Parent comment is correct in that trace ID isnt technically needed, and is in fact quite controversial. Its an implementation level protocol optimization though, and unfortunately not an objective one. It creates an arbitrary grouping of these annotations - which is entirely subjective, and the spec struggles to reconcile - but its primarily because the technology to aggregate and/or query them would be far more difficult if you didn't keep that simple GUID.
It does have one positive benefit beyond that. If you lose data, or have disparate systems, its pretty easy to keep the Trace ID intact and still have better instrumentation than otherwise.
I am absolutely not seeing the problem you describe, I appreciate the ability to have a single ID for stuff that might go through several microservices. Helps us discover all data related to a single request-like entity in case of trouble. It already helped us trace semi-failed complex processes related to separate customers and we were able to manually fix things in a matter of minutes.
As said in other replies to you, I am still not at all clear what your issue with OTel is.
3 GUIDs then, event ID was inferred. I see.
tl;dr OpenTelemetry eats Sentry's cake by commoditizing what they do and the reaction of the founder of Sentry is to be very upset about it rather than innovating.
> Everyone and their mother is running a shoddy microservice-coupled stack,
buried the lede!
Up my alley. I'm the author of a FOSS Golang span instrumentation library for latency (LatLearn in my GitHub.) And part of the team that back in 2006/2007 made an in-house distributed tracing solution for Orbitz.