Observability vs. monitoring debate: An irreverent view

47 points by simskij 5 years ago · 18 comments

Reader

vmarsy 5 years ago

Observability isn't just a rebranding of Monitoring, it's Monitoring + making it the most actionnable possible via standardization.

Specifically, how to make the sum of all monitored "pillars" more useful than each of them individually.

3 major pillars being:

- Metrics (whether application or higher-level of the stack, like OS)

- Logs (whether structured or unstructured)

- Traces

Observability is these major pillars and how to easily "jump" from one to another to very quickly identify the root cause of an issue. I.e. go Metrics <-> Logs, Logs <-> Traces, or Metrics <-> Traces,

For instance, with good Metrics, one can easily figure out & get alerts when there is a large spike of 500 errors. But when Metrics & Logs can work together, one can easily see the exception from stack trace that are emitted with those 500 errors.

Similarly, with good Metrics, one can easily figure out that the frontend service latency p90 has increased by 5x. But with Metrics & Traces working together(for instance via Exemplar[1]), one can look at a bunch of the traces that have a very high latency, and identify the upstream service responsible for this increase.

With Monitoring only, you could get a nice Metrics solution in place, with fancy alerting rules, but all it was good at is informing you "Something bad is currently happening". With a good "Observability" setup, you should also be able to change it to "Something bad is currently happening and the root cause is right here."

[1] https://grafana.com/docs/grafana/latest/basics/exemplars/

simskijOP 5 years ago

logs, traces and profiling were all viable parts of a good monitoring stack even prior to the term observability being coined.
mmanciop 5 years ago

That’s trope #1 right there :-)

thinkingkong 5 years ago

Both concepts are related but separate. I'd argue that monitoring is the act of alerting or taking action when some metric is out of bounds with normal values. Those values are based on absolute thresholds, or some changing thresholds based on historical values. Observability is the ability to see what's happening across and between multiple systems so that any issues raised by monitoring can be diagnosed quickly. "I know theres a problem vs I know which part of the system its happening in"

mmanciop 5 years ago

I very much agree: observability is a property of a system, like transparency is for glass. Monitoring is what you need to do with a lot of systems to make them observable.
ittan 5 years ago

Not to be rude, Monitoring is observing problems and then responding to the alarm in realtime.
You cannot compare the two. Yet we do in certain circumstances and that is a loss in understanding, which I am sad about.

idoco 5 years ago

I see OpenTelemetry as an application of the same idea pushed by Google via the k8s revolution.

Create a great vendor-agnostic open source tech. Get everyone riled up about the dangers of vendor-locking solutions. Use the new tech to carve yourself a piece of the market from the current incumbent.

It is pretty great and all, but sometimes it is easier to build your app with a simple vendor-locked tech than a super generic agnostic technology.

pm90 5 years ago

I agree with this.
It’s kinda important to understand who all this is meant for. If you’re a lean startup just use the best/cheapest/quickest tool regardless of vendor lock in. It’s when you get to a certain scale that vendor agnosticism becomes a real concern, but by then you probably have enough resources to hire folks that will rebuild your stack.
zihotki 5 years ago

> It is pretty great and all, but sometimes it is easier to build your app with a simple vendor-locked tech than a super generic agnostic technology.
That's for sure golden. Until the product is bought over, or there was a merge of companies, or you name it. And then you end up with a pile of products with different vendor-locked log formats, metrics. At that point you'd like to get some standardization. And OpenTelemetry is a perfect candidate for common ground. Thus support of OpenTelemetry becomes a major decision factor when selecting a vendor or OSS solution for your problem, isn't it?

goodpoint 5 years ago

"Observability" existed 20 years ago and it consisted in application metrics, OS metrics, log analysis and correlation, SIEM.

It was called monitoring.

simskijOP 5 years ago

which i think is the point of the article as well. :)
- mattcwilson 5 years ago
  
  The article mentions that the term “observability” was coined in 1960.
- mmanciop 5 years ago
  
  It very much is :-)

pm90 5 years ago

There’s certainly elements of hype to the term since it’s new and a lot of startups and projects are trying to get the biggest mindshare.

Monitoring to me is exclusively about metrics and alerts. Metrics are really useful but they often don’t give you the whole context, and might sometimes be misleading. Eg you see a spike in cpu usage for a service; you probably just autoscaler and call it a day, and that’s the end of it. Having metrics is SO much better than not having metrics though, it delivered insights that were just not possible before.

Observability to me is the next iteration of this process of understanding system behavior. Metrics are limiting, so maybe you look at logs. Well, they suffer from some of the same issues, so you try profiling, you try tracing. Ultimately the goal is to explore tools that allow developers to quickly get a truthful understanding of how their systems really work, and use that knowledge to improve their systems.

csours 5 years ago

I see it as a tiered approach - Detailed Process Logs saved to a db, Audits, Regular logs streamed to your favorite log dump, Monitoring, Health Checks, Metrics, and Usage Statistics are all part of Observability.

If it is a critical process, use detailed process logs.

If it is an extremely critical process add Auditing.

nivertech 5 years ago
```
  Regular logs (text files) > Detailed Process Logs (DB) > Audits (read-only/tamper-proof)
```
What's the difference between Regular logs, Detailed Process Logs and Audits?
Are Audits more compliance related, while detailed logs are business only?
It's not clear to me why to have a special detailed process log, if you can you can write important data to Audits, and less important to regular logs, i.e. once can use different log levels for business data vs engineering/debug data.
Can you please give an example?
- csours 5 years ago
  
  Regular logs are just whatever your app logs normally (engineering/debug).
  Process logs are business processes that you need to report on or answer for later.
  Audits are another business layer for safety or life critical data, or adversarial data (handling money based transactions). They could be compliance related. Auditing is a three step process - Ask a question, figure out how to answer it, and make sure you've answered it.
  So specifically, if you have a magic bean shop and you want to log transfer of the beans, that would be a process log. Then at the end of the day, you would send a manifest of all the beans sent that day. The audit process would look at the final location of the beans, and also at the process log to answer your audit questions.
  ---
  I think a lot of people make a hard distinction between business logging and engineering logging, and I'm not sure that makes sense.
mmanciop 5 years ago

Yep, pretty much. There is plenty different types of telemetry one can use, and which ones are important for one application depends both on the nature of that application, and who operates them.

Settings

Observability vs. monitoring debate: An irreverent view

Keyboard Shortcuts