Amazon Managed Service for Prometheus
aws.amazon.comThe pricing just for the ingest seems way off. $0.002 for 10,000 metrics might not seem like much by even a simple node_exporter will grab 700 metrics every 15 seconds.
Thats $24/month just to ingest the cpu/ram/diskspace data from each server. Plus storage and query costs.
At work I have a single r4.xlarge instance handling 1.3 million metrics every 15 seconds. Storage is not clustered but cost is only $500/month. It would cost me $45k/month just for the ingest with the new managed service.
Pricing makes sense if you consider how Amazon operates at this point.
You put basically a MVP product out there with abnormal pricing. Your enterprise customers that are drowning in money can start using it and using that money you can grow your org by hiring more engineers. At this point you start working on adding new features and do cost optimization. Since your whole architecture was designed based on "we have to ship this ASAP", you deliver some real nice cost reduction easily. Then you reflect this to your customers and gain goodwill and good PR.
And let's be honest. We all know a company or two that would throw _way_ more than 45k/year at a global metrics solution to handle that volume, and still wind up with a flaming scrap heap. And a promotion or two.
I don’t think any company is drowning in money. Everyone has a budget they are working against. At the end of the day, you can bite an engineer or pay aws more. It’s all a trade off.
Their pricing for these managed services used to be "no brainer" (something like the cost of compute only, or maybe a <30% upcharge). Managed airflow was similarly very expensive (maybe 3x the cost). Just not worth it. Bummer.
Yeah, it turns out there's a lot of money to be made from people who don't have a good grasp of the fundamentals. We got a marketing email from Huggingface recently about their ML-models-as-a-service offering: https://huggingface.co/pricing
One of my colleagues asked if it might be better than creating our own infrastructure for that. I ran the numbers for one of our recent jobs, feeding a million tweets to two ML models to see which worked better. That would have cost about $1800 on Huggingface. Using AWS spot instances, it was maybe $25 for us to run ourselves.
Of course, we can do it at that price because we are paying for engineers and plan on classifying enormous amounts of text, so it works out for us. Plenty of other people probably should just use Huggingface. But I can't help looking at that 70x markup and think, "Fuck me? No, fuck you!"
Pricing makes sense for enterprises. Considering that you may need a team (or a part of one) to maintain a self-hosted cluster at possibly 0.995 reliability, do upgrades, manage devs, run all the mandatory security scans, justify why some enterprise scan tool throwing errors isn't an issue, etc. Oh also justify why you need the manpower to do it, at which point your VP will tell you to just use the managed service.
It doesn’t though. I just did a cost projection on our estate and hiring two engineers to look after it on bare metal VMs is 30% cheaper than using the managed service. Plus it doesn’t require a lot of maintenance so we can use those guys on improving the product as well which actually gives direct customer benefits.
both google and amazon are insane with their observability services.
we ran away screaming from stackdriver when we saw how costs started piling up.
thank god for prometheus and grafana.
That's probably lesser than your team's payroll budget :) Their positioning is that you can reduce the staff needed to operate and maintain these instances.
14 cents per "query processing minute" sounds like it could add up very fast. Prom queries can get somewhat complex and it's not rare at all IME to have a dashboard making several multi-second queries per load (whether that falls into "you're using Prometheus wrong" being a separate discussion of course)
Edit: The example from their pricing page:
> We will assume you have 1 end user monitoring a dashboard for an average of 2 hours per day refreshing it every 60 seconds with 20 chart widgets per dashboard (assuming 1 PromQL query per widget)... assuming 18ms per query for this example.
Comes out to over $3 per month in query costs. Replace this 1 person with a TV showing the dashboard all day, and the cost jumps to $36, for just one dashboard and (again IME) overly fast query estimates... o.O
Does it put any limits on cardinality of metrics? Grafana cloud's offering was absolutely awful for my use cases. They charge per-series so if you have metrics with a "pod=..." label your prices go through the roof.
Every managed metrics system will put a limit on cardinality because all mainstream available metrics systems cost more per cardinality to query and store. If they don’t limit that you can assume you or some other customer is going to use up the clusters resources and cause an outage.
Like most metrics systems, under the covers in Prometheus each unique combination of dimensions is the same as a new metric line.
Plenty has been written about not using the server/container/pod id as a label because it leads to high cardinality which leads to poor performance (cost aside). Time series databases have been purpose-built for certain workloads and you can consider this their weakness.
Plenty has also been written about the bugs/issues that have cropped up that are only visible when inspecting what regions/nodes/cgroups an issue is coming from [0]. My use case wasn't exactly `pod=...` but it was very similar. It was more like `device=...`. Also, for a huge application, it's not uncommon to have 100s or even 1000s of metrics that are important to application health/performance. Constantly saying "do you really need X? It will cost us Y" will lead to an extremely under-monitored application.
[0] - https://cloud.google.com/blog/products/management-tools/sre-...
Plenty of companies run their own servers because cloud is too expensive at their scale. Same goes for metrics. It's a direct result of one-price-fits-all pricing models for software as well as pricing that is not correctly tied to value.
I like Weave Cloud’s Prometheus hosting model — it’s per host, which is predictable and forecastable.
Now do six dashboards, 10 widgets each, multiple viewers, 18h/day and one slowish query on each dashboard. Seems like we get to hundred+ pretty quick
Caching means that multiple viewers cost very little extra.
(I am a Cortex maintainer)
Yeah I dunno about this, and the grafana service. They’re not exactly complicated to run on their own. At this pricing you may as well be on Datadog.
I've commented fairly heavily in the related Grafana thread.
Prometheus is a bit of a different story. It does have some operational overhead when you get to a certain point, and scaling it out is not always trivial.
Assuming it works, there is value-add on this one, and the pricing is more in line with active use (ie, a cost+ model, which is more typical of AWS services)
This seems more interesting of the two, grafana is pretty simple to setup and maintain. The harder part is handling the metrics themselves, be it with influxdb, prometheus, or something else.
Scaling prometheus across multiple separate Kubernetes clusters is a fking nightmare.
Use Victoria metrics. One lightweight agent per cluster pushing to a centralised metrics store makes it so much easier.
setting up one Prometheus server is easy. scaling, HA, Metrics retention for more than 3 days not so much.
Prometheus is not easy to run at scale on the storage side.
This is all relative but I don't personally think so. Not on EC2+EBS, anyway. Certainly not as difficult as running/scaling an ES or Kafka cluster.
It's a completely different problem because by default Prometheus does not shard anything so you're bound to a single instance, where ES and Kafka are cluster based.
Out of interest what do you find hard about running ElasticSearch clusters?
In my experience ES has been one of the easiest clustered / highly available and sharded systems I've ever run - especially for how incredibly performant and reliable it is.
I've generally found that beyond right sizing your nodes, indexes and shard configuration - it pretty much just works without ever really having issues.
Victoria Metrics is an absolutely superb drop in replacement.
It's not a drop-in replacement (even though it tries to sell itself as such), it's incompatible in a significant number of ways and throws away part of your data.
We use Victoria Metrics in Prod for more than 6 months. It is very reliable and scalable. Victoria metrics handles more than 2B+ series in our setup without breaking a sweat.
No, it's better. You can focus on the thing you're measuring rather than the bloody platform.
They made some pragmatic optimizations by dropping part of the data where it mostly doesn't matter. Who's actually affected by this? 0.01% of users?
You could say the same about any SaaS based on open source, but people still find it useful
From the pricing section:
> AMP counts each metric sample ingested to the secured Prometheus-compatible endpoint. AMP also calculates the stored metric samples and metric metadata in gigabytes (GB), where 1GB is 230 bytes.
Surely that's a typo, right?
Likely a casualty of copy and paste that left out the superscript formatting. 1GB is 2^30 bytes.
Everyone here complaining about the pricing on the managed Grafana and Prometheus services have clearly never worked at a shop using SumoLogic. Log/metric processing/querying is expensive for a reason.
I very much dislike Prometheus, but the fact that AWS is offering it as a managed service means I am in the minority. I attribute much of Prometheus' success to the influence of ex-Googlers. They joined other companies, had a lot of clout, and sought out a tool that was similar to what they once used.
I understand that the Google version of Prometheus is deprecated but there is no commercial equivalent.
What is in your opinion a better open source alternative to prometheus?
Borgmon was inspiration for prometheus but was a totally different project so it is a complete rewrite
I feel like a broken record, but we are having great success with Victoria metrics as a drop in replacement.
Promscale looks interesting. Keep the architecture of Prometheus while storing the data in TimescaleDB and using SQL as the query language (together with the TimescaleDB-specific extensions to it). Does anyone actually like PromQL?
Its nearly weekly we bump up against the limits of timeseries aggregation. Id take anything else foss at this point.
I don't see any mention of Pushgateway. They'll need to add that or I won't be able to monitor ephemeral jobs.
Hey... Marc here from AWS. I'm the PM lead for this service. Thank you for the feedback. Pushgateway is important for our customers and it is a feature we are looking to support as part of our roadmap. For the time being, you can continue to use the Pushgateway as you do today and remote write the metrics to AMP for long term storage and querying!
I just went through the "process" of installing Grafana, Loki, Promtail and Prometheus on an ubuntu box and it is almost like the company behind all of this has gone out of the their way to make it hard. It isn't really _that_ difficult to get set up, but it also isn't 'apt install' easy (you really want me to create my own startup scripts?) and required me to build my own documentation on how I installed everything.
It's almost like the company behind it wants to see some profit after pouring millions of dollars into developing these tools. Except, in 2020 you cannot just have a closed-source easy-to-use documented and supported product with a license fee. Not in the server market, at least. Everything must be free and open-source, and you are expected to make money by offering a hosted service. Except, good luck competing with Big Cloud.
It's extremely worrisome. The incentive to spend your early mornings, nights, and weekends building something awesome to free yourself from corporate life is fading away. They need to institute some kind of royalty program or at least dedicate engineers to helping maintain the projects they make into services.
Almost have to change gears and get into a scientific field that isn't computer science.
One of the Loki maintainers here (though I mostly work on other stuff now). I promise it's not difficult on purpose.
We've put a lot of effort into optimizing the Kubernetes experience that non-containerized installations haven't been getting as much attention. We'd be thrilled to have system packages for Loki that also set it up as a service, it's just not something we've been able to spend time doing ourselves yet.
Honestly I mostly throw out the Debian service definitions anyways - when clustering or interacting with Chef or Ansible or whatever, you end up building a lot of ‘smarts’ around a custom supervisor like Runit or skarnet or systemd
It isn't just loki, but the whole stack. Grafana is the only project mentioned that has a debian installer.
The expectation that someone doing greenfield development is going to jump into k8s just to use the software is kind of weird.
I’m deploying it (prom, alertmanager, pushgateway, grafana) on native hardware via ansible and it’s not difficult. Not Loki (yet). It’s all just go binaries you fire up with systemd with a single config file.
I find it harder to deploy reliably on kubernetes with persistent volumes etc.
All of those who have spent their free time contributing to Linux distributions are why 'apt install' is easy. You can contribute too.
As the co-founder of Apache Java and a 20+ year member of the ASF, creator and contributor to hundreds of projects over the years, I think I've contributed enough of my time to OSS. I'm more than happy to let the new kids jump in. Thanks for the 'advice'.
What percentage of ASF projects use 'apt install' at all? Did the Apache folks themselves make the packages? Should we complain about ASF for not making an 'apt install' for each of their projects?
Of a random sampling of install instructions for different ASF projects, the instructions generally are "1. Install java" then "2. Download this binary" and "3. Run the binary with java". Not quite 'apt install', is it?
I wonder how AWS is supporting the development of Prometheus. Are they financing the OSS developers who are spending countless hours dedicated to the project?
AWS is an investor in Weaveworks where the implementation (Cortex) was first created. Weaveworks had two Prometheus maintainers on staff at the time.
In the announcement it says AWS have a commercial relationship with Grafana Labs, where several Prometheus maintainers, community managers, etc. currently work.
(I work for Weaveworks)
Do we know which scalable prometheus backend are they running? Chronosphere? Thanos?
The Grafana blog post mentions Cortex, something I'm not familiar with:
https://grafana.com/blog/2020/12/15/announcing-amazon-manage...
It’s Cortex, though the particular configuration shares a lot of code with Thanos.
(I am a Cortex maintainer)
If you know technical details, are there any metrics cardinality limitations?
There are soft limits _everywhere_, to stop people shooting themselves in the foot. Those can be raised by admins after checking the user knows what they are doing.
I do not know what the practical limits are right now; especially I do not know what size hardware AWS run it on.
If you were to search the Cortex Slack you would find people talking about instances with 100 million series, also people talking about work to improve scalability.
Now that Aws ate the world, can we get some useable gui or consistent cli?
I wonder if it will be possible to migrate your data somewhere else once it becomes too expensive.
Hoe can I hide all Amazon and Google news on HN ? Do you know an alternative of HN without big tech lobby? Thanks.
I'm interested in this too, more I see aws dominating every aspect of our life, more depressed I become.