Settings

Theme

Ask HN: Why is observability so broken, and what can fix it?

6 points by idea0rbit 5 months ago · 5 comments


fogzen 5 months ago

It’s broken because copying all data somewhere else to query isn’t scalable, technically or financially. Especially when all the observability business models are based on charging for data.

The whole observably industry is based on a flawed approach that you can add it in after the fact, ship that data somewhere else, then charge an arm and a leg to access it. That breaks down quickly in any non-trivial distributed system. Even when ignoring issues with sampling, it is cost prohibitive.

  • idea0rbitOP 5 months ago

    You are absolutely right. What could an alternate design and business model look like?

zarathustra333 5 months ago

are you talking about observability for AI workflows or more generally? I have a friend working on the former.

  • idea0rbitOP 5 months ago

    both. In general I think most observability systems are broken. This article captures the sentiment pretty well

    https://www.linkedin.com/pulse/observability-broken-its-time...

    • tanelpoder 5 months ago

      Good article, thanks for sharing. I've been working on one part of this problem space for quite a while too. I want ability to directly drill down into latency reasons and underlying application component threads' wall-clock time, instead of having to correlate various systemwide utilization metrics and try to manually connect the dots.

      I'm using eBPF-based dimensional data analysis, starting from bottom up (every system is a bunch of threads, including distributed systems) and move up from there. This doesn't replace existing distributed tracing approaches for end to end request view, but gives you deep observability all the way down to each service's underlying threads' wall-clock time (where blocked, sleeping and why, etc).

      At this year's P99CONF I will launch the first GA release of my (open source) 0x.tools xcapture eBPF collectors, with a reference implementation of a TUI tool, showing dimensional performance modeling on these new thread sampling signals (xtop).

      A couple of 1-minute asciicasts of xtop are here: https://tanelpoder.com/posts/xcapture-xtop-beta/

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection