Prometheus 2.0

274 points by bbrazil 8 years ago · 42 comments

Reader

Dowwie 8 years ago

The TimescaleDB crew built a Prometheus-PG adapter [1]. TimescaleDB, and consequently Postgres, is an option I am exploring. Does a developer realize any benefits of 2.0 if the Prometheus database isn't used as a backend?

[1] https://github.com/timescale/prometheus-postgresql-adapter

jrv 8 years ago

I am not too familiar with the PG-TimescaleDB adapter, but the storage engine rewrite in Prometheus was definitely the most important improvement in Prometheus 2.0.
There are other changes than the storage though that are outlined in the migration guide: https://prometheus.io/docs/prometheus/latest/migration/
* Minor PromQL changes (mainly removals of deprecated features)
* Staleness handling (not sure if the TimescaleDB storage handles this correctly, as it relies on storing a special NaN value bit-wise correctly for marking series stale)
* New rule file format is now YAML and supports per-group evaluation intervals and serialization of rule executions within one group (in case of data dependencies)
* Other minor bits

nopzor 8 years ago

Really psyched to see this drop, the new engines performance looks amazing. And performance was already pretty stellar to start with!

Looks like Prometheus is also now the most popular tsdb on github!

The grafana team is working quite heavily on our Prometheus integrations, with some major recent improvements to the query editor in 4.6. We have many more plans to work even more tightly with the project.

Kudos to the Prometheus team!

jordan801 8 years ago

I was curious what this is, so I clicked on it. I spent around 30 seconds scanning the homepage and this is my observation: Uses data to make insights.

What kind of data? I have no clue. How? Not the faintest.

Anyway, you might want to add a portion to describe what it is for us scrubs that haven't a clue. Possibly a demo page?

kayoone 8 years ago

Prometheus is pretty popular in this community as a monitoring and metrics server, which is why this announcement got upvoted so much even if it lacks a general introduction.
nerdponx 8 years ago

Power your metrics and alerting with a leading open-source monitoring solution.
A little vague but I'm pretty sure I know what these guys do based on that.

sz4kerto 8 years ago

No good migration story for existing data, that'll hurt us quite a lot. :(

jrv 8 years ago

Yeah, as Prometheus's local storage is meant more as a transient / non-durable metrics store, the only current way to migrate while simultaneously accessing old and new data is to run both the old and new servers and have the new one read old data from the other one via the remote-read integration.
Someone could write a tool to do a full migration of the old storage format to the new one, but the formats are completely different and at least in the naive version of such tooling, that would have to happen offline and take a very long time to run for large storages.
EDIT: If you would like to fund development of such a tool, let us know :)
- sz4kerto 8 years ago
  
  > EDIT: If you would like to fund development of such a tool, let us know :)
  We're too small for that, unfortunately. Someone said at DockerCon that migrating large (multi-TB) stores would take a long time; this doesn't apply to us, we have only ~0.1 TB perf data as of now.
  - jrv 8 years ago
    
    Out of curiosity, do you care about migrating the data online, or would a brief Prometheus downtime (and thus gap in data) be ok?
    
    sz4kerto 8 years ago
    
    Downtime would be acceptable I think, especially because we could just launch a separate instance of Prom while the main one is being migrated.
bbrazilOP 8 years ago

There is a transition feature: https://www.robustperception.io/accessing-data-from-promethe...
The problem with data migration is that the two versions of the system lay out the data quite differently, so converting from one to the other would take a lot of disk seeks. In the worst case you could be looking potentially at days to convert the data over, which isn't really an option for most systems that care about older data.
- sz4kerto 8 years ago
  
  Seek time is not relevant, our stuff is on SSDs. Thanks for the link, I've known about the transition feature.
jrv 8 years ago

First approaches are happening for this now:
https://groups.google.com/forum/#!topic/prometheus-users/wO5...
- https://github.com/Percona-Lab/prom-migrate (requires old Prom server to run for reading out data)
- https://github.com/juliusv/prom-data-migrator (operates offline on old and new storage dirs directly)
SuperQue 8 years ago

Percona just released a migration tool.
https://github.com/Percona-Lab/prom-migrate
lawnchair 8 years ago

If you need your Prometheus data to survive, you're doing it wrong.
- sz4kerto 8 years ago
  
  What are we doing wrong? We use Prometheus for many things, for example I'd like to know how specific latencies have changed over time. Why should I store these numbers somewhere else?
  - bbrazilOP 8 years ago
    
    Prometheus is not intended as durable long term storage, it's fundamentally limited to the size of a machine. You should also design your monitoring be able to tolerate completely losing the data of a Prometheus.
    We recommend using another system for long term data, see https://prometheus.io/docs/operating/integrations/#remote-en... for some examples.
    
    jrv 8 years ago
    
    The problem is (as you know) that single machines are in practice still too reliable and Prometheus is still too good at storing data for long times that many people have come to rely on it despite warnings :)

SkyRocknRoll 8 years ago

Does this storage engine onpar with influxdb ?

pauldix 8 years ago

InfluxDB creator here. I'd say it's on par and in some cases a bit better. They can do some optimizations that we're unable to do. We have to worry about writing historical data rather than just append only workloads. We also have to support more data types than float64.
The design of the new Prometheus engine and InfluxDB's Time Structured Merge Tree engine are very similar. They're both heavily inspired by LSM Trees.
Our goal is to add more support for Prometheus standards like the metrics format and even PromQL over time. We recently added remote read/write support, which will be in the 1.4 release next week. https://www.influxdata.com/blog/influxdb-now-supports-promet...
I'd like to see Influx become something that is complementary to Prometheus or at least solidly in the Prometheus ecosystem, like players like Weaveworks, etc.
will-not-smith 8 years ago

Nope, not on par.
While there is no direct comparison to InfluxDB in the article, look at the disk usage and (more importantly) disk I/O utilization, which is indistinguishable from zero on the graph, whereas before it was at 20-30% with the same load.
So not on par. Much, much better.
(Of course, Prometheus and InfluxDB are not perfect substitutes for one another, so there's much more to look at than just storage engine performance.)
- bbrazilOP 8 years ago
  
  The third version of the InfluxDB database is basically the same core design as Prometheus's third database version which 2.0 uses. I'd expect performance to be broadly similar, maybe a little better with Prometheus as we can take advantage of the characteristics of our domain.
  We do have a comparison to Influx on our website: https://prometheus.io/docs/introduction/comparison/#promethe... Which is the right choice really depends on the use case. If you're doing metrics-based operational monitoring, Prometheus is generally best.
  Disclaimer: Prometheus developer.
- SuperQue 8 years ago
  
  We just upgraded one of GitLab's Prometheus servers (we have HA redundant servers).
  We saw a better than 10x reduction in CPU use.
  https://imgur.com/3aJtYu9
  This server doesn't take any query load, but it does take the same ingestion load as the others.
  (The long down time in the middle is waiting for rm -rf on the old data dir.)
SEJeff 8 years ago

FWIW, when I met Paul Dix a few months ago, he had nothing but good to say about the new Prometheus. They released this blog post shortly after our meeting:
https://www.influxdata.com/blog/influxdb-now-supports-promet...
I wouldn't see them as competitors so much as complimentary. Influx is really going for the long term tsdb storage, whereas prometheus is more shorter term with higher resolutions and alerting.
- pauldix 8 years ago
  
  Yup, I'm definitely a fan of their work. The new engine is a great design and all the work on the metrics standard is great. Having that as a standard format for service implementors to expose performance metrics is invaluable and we're going to support that as much as we can.

pfranz 8 years ago

Literally yesterday I did an install of 1.8.2 after seeing 2.x still in beta. Time to upgrade.

Congrats to the Prometheus team.

wanghq 8 years ago

Does anyone need to view metrics older than 2 weeks? What's your solution? I feel it's odd if you rely on a different software, e.g. for the most recent data, use Prometheus, and for older data, use something else. What if I want to compare the same metric in the last 4 weeks?

koffiezet 8 years ago

Multi-year projections? Client of ours has busy periods every year, and they want year-by-year comparisons. They expect/pay us to keep metrics (although heavily compressed/condensed) of the last 5 years.

the_arun 8 years ago

How Prometheus compares to Splunk?

bbrazilOP 8 years ago

Splunk is a event logging system, compared to Prometheus which is metrics based. You need both types of systems to be able to properly observe your systems, they're complementary.
- chimeracoder 8 years ago
  
  > Splunk is a event logging system, compared to Prometheus which is metrics based. You need both types of systems to be able to properly observe your systems, they're complementary.
  While this is the traditional way of looking at them, I strongly disagree that metrics and logs are different toolsets, or that you would need both of them in order to properly observe your systems.
  I've written and spoken about this approach before: https://medium.com/@chimeracoder/dont-read-your-logs-13586c7... and https://vimeo.com/221049715
  - bbrazilOP 8 years ago
    
    > I strongly disagree that metrics and logs are different toolsets, or that you would need both of them in order to properly observe your systems.
    And from the link:
    > Logging can be useful for some purposes. However, it’s rare that they’re the only tool for monitoring your code. And it’s even rarer that they’re the best tool.
    Metrics are a tool that take a different approach to logs, once you get beyond small systems you need both. I talked about this earlier in the year: https://www.youtube.com/watch?v=hCBGyLRJ1qo
    
    eternalban 8 years ago
    
    https://youtu.be/hCBGyLRJ1qo?t=6m45s
    [edit]
    So I watched a few minutes into this, Brian, and it seems to me that either an expert system, or some form of rudimentary AI, that observes the monitoring system can be the driver of an intelligent alerting system. In other words, it seems 'alerts' are, in the final analysis, the higher value proposition.
    And I fully agree with you: it really is a waste of talent to have engineers glued to screens watching graphs.
    
    chimeracoder 8 years ago
    
    > Metrics are a tool that take a different approach to logs, once you get beyond small systems you need both. I talked about this earlier in the year:
    Quite the opposite - the "some purposes" I'm talking about are precisely the small scale. As scale grows, the use case of logs and metrics converges, and metrics become a strictly better tool.
    
    jrv 8 years ago
    
    The question is about tracking and storing individual events (logs) with arbitrary per-item detail vs. dimensionally limited aggregations (time series / metrics). In either case, I think we agree that the data should be recorded in a structured way, and when I say "logs" I just mean a record of individual items, not of sampled/aggregated metrics.
    Given that, you need both logs (individual events) and metrics. Logs give you crucial insight into individual interesting events such as single requests that bring your service down, but logs are orders of magnitude more expensive than metrics in tracking, storage, and processing. So that's why you use metrics for a much wider scope and for longer time periods.
    
    scotch_drinker 8 years ago
    
    Not for high cardinality events like what happened to a particular user during a single session. Metrics will never help with that type of problem.
    
    chimeracoder 8 years ago
    
    > Not for high cardinality events like what happened to a particular user during a single session. Metrics will never help with that type of problem.
    No, and as I explain in both that article and the video, logs aren't the best solution for that use case either.

frik 8 years ago

Does Prometheus 2 still use the unmaintained charting library "flot charts" ?

http://www.flotcharts.org/

No commits since 2014. Unfortunately, because it's a great chart library (actually probably the best free one), only the reliance on JQuery isn't that fashionable.

Edit: oh, it was Grafana and Kibana that used Flot charts (at least when I tried it two years ago, maybe things have changed) quick Google (2014): https://github.com/grafana/grafana/issues/222

jrv 8 years ago

No, we never used flot charts. We're using Rickshaw since the beginning: http://code.shutterstock.com/rickshaw/
It was unmaintained for a while, but now it seems there've been commits in the recent past again. However, for real dashboarding (not just simple ad-hoc queries), we recommend using Grafana anyway.
SEJeff 8 years ago

This is one of the other "better" charting libraries, and it is maintained: http://nvd3.org/

Settings

Prometheus 2.0

Keyboard Shortcuts