How to walk Datadog without running out of money

9 min read Original article ↗

Almenon

So you’re interested in adding a APM, logging, or monitoring tool to your tech stack. Maybe you’re tired of maintaining Elasticsearch. Maybe you just don’t like Dynatrace. Whatever the case, you’ve heard of a new dog in data town: Datadog.

Cute, but wait till you see the price tag

You create an account and try it out. Setting it up on Kubernetes with helm is a breeze. All your logs and data are automatically forwarded. It’s easy and fast to query your data, and you can make stylish chart by the dozens. You love the UI and more importantly, the power-user shortcuts. The breadth and depth of features makes Datadog feel like the AWS of the monitoring world. Soon you’re hooked. What could go wrong?

You sign a contract, forward your production data to DataDog, a few hiccups, but things work out pretty well. That’s when you see your first real bill.

Oh no.

You remember all the articles of people running up gigantic AWS bills by accident? Well, Datadog’s got the same problem. A simple configuration error could set you back a grand, easy. Hitting the max spend on your credit card? Yep, I’ve seen that. If you misunderstand the billing model, or assume “monthly contract” means “monthly discount”, the bill could be far more than expected.

As one reddit user says:

wingerd33: “The Datadog pricing model is actually pretty easy. For 500 hosts or less, you just sign over your company and all its assets to them. If >500 hosts, you need to additionally raise VC money.”

Let’s go through some Datadog offerings:

Logging

Pay for the number of log lines and the storage size of your logs. You can make this cheaper by lowering your retention window and archiving older logs in a S3 bucket. In my experience developers rarely need logs older than two weeks. The catch is that sometimes people will write inefficient log archive queries (“rehydration” queries) that can cost hundreds of dollars in re-ingest. Make sure to pair log archiving with training on how to use and query it. Set a max scan size setting, which has the added bonus of mandating the developers run scan size estimates.

You can also try to save money by setting log exclusions in Datadog’s website. The catch is you still pay for storing excluded logs, but better than paying for both storage and log lines. If you really want to save money, you can exclude logs before they hit Datadog by raising your log level or configuring your Datadog agent.

I highly suggest setting up an monitor to alert you when there’s a spike in logs. It’s fast and easy to create. See https://docs.datadoghq.com/logs/guide/best-practices-for-log-management/#alert-on-unexpected-log-traffic-spikes

A possible cause of excess logs is long stack traces. I suggest avoiding these by using JSON logging, so stack traces come as a single line. As a bonus, JSON is automatically parsed by Datadog.

Edge Delta offers a service that claims to reduce Datadog logging cost. I haven’t tried it, but it looked interesting.

Custom Metrics

Datadog charges you your average usage for custom metrics. Simple, but you can still get a unexpected bill. Why? A custom metric is defined as a “a unique combination of a metric’s name and tag values”. If you have two metrics, but each metric has a tag with 1000 values, you’re really looking at 2000 billable metrics. It gets better: If each metric has another tag (A/B), then you have 2*2000=4000 custom metrics. If the extra tag was not just A/B but a 1000 possible values, then you have 1000*1000=1,000,000 metrics. All this from just two metric names! It gets even more complex than this — I suggest reading the full article.

To be fair to Datadog, this is not a Datadog-specific billing issue. You could run into this cardinality problem with any Prometheus-style database.

Infrastructure Monitoring

This is very simple. Take the p95 percentile of the lower 99% of your host-hours above the high-water mark line, add in the baseline commitment, rank from most primes to least, and finally take the eighth highest measurement, unless you’re on the MHP plan, in which case you should baste the skin of every third EC2 host and multiply your Kubernetes nodes by 15. Simple.

But seriously, it can be a bit complicated. Let’s start with the basics.

Each hour, Datadog records the number of unique hosts you are monitoring in the Infrastructure service.

Don’t skim this. Really think about it. What are the implications of this?

The key word here is the word unique. This is even worse than taking the max, and has very unfortunate implications in certain scenarios. Let’s say you have one host up for one hour. The pricing would be 1*.03/hr = 3 cents. Now let’s say you have host A up for 20 minutes, host B for 20 minutes, then host C for 20 minutes. You only have one host up at a time, effectively it is the same as before. However, the pricing would be 3*.03/hr = 9 cents. We just tripled in price!

If you have a static set of EC2 servers, you probably went ‘okay, big deal. Who cares about this weird scenario?’ But if you’re running Kubernetes or other auto-scalers, you might have just gone ‘aaaaaaaaahhhhhhhhhhhh’. Nodes in Kubernetes are constantly at peril of being replaced, coming up, or going down. As a consequence, there’s a good chance your host cost will be higher than you expect. Keep this in mind as we go through the two different plan types.

High-water Mark

With the high water mark plan you are billed based on the 8th-highest hour in a list of hours sorted by number of hosts. The idea is you avoid outliers, but sometimes the outliers last longer than 8 hours, and your eighth highest hour is generally pretty expensive, so this plan may have severe drawbacks.

Commitment

With the monthly commitment plan you commit to a certain amount of hosts at a discounted price, while paying regular rates for any hosts above that. A 9-hour long spike would just incur regular rates for those 9 hours, instead of ruining your entire month like in the high-water mark plan. As a consequence, you have the potential of saving tens of thousands of dollars on this plan. However, this plan still has drawbacks. Take the following graph.

  • The blue line is the number of hosts
  • The red line is what you committed to

If you average out the blue line, it approximates the red line, so you might think you have satisfied your commitment and will pay everything at reduced rates. That is not the case. You will always pay for the committed amount. You will pay for the red line, plus the blue line when it goes over the red line. Additionally, when the blue goes over you pay on-demand rates.

Important note: When I say ‘on-demand’ rates, I mean ‘expensive’ rates.

In conclusion, do not average out your hosts and think you matched the commitment. Accounting will expected to have been done with the lump sum payment for the committed amount at the start of the month or year,. They will not be happy to randomly receive on-demand charges. This brings us to another plan type.

The Secret Plan Type

This is a invite-only plan. I could tell you about it, but then I would have to kill you.

APM

Application Performance Monitoring bills both on hosts and number of spans. The host billing model is similar to the infrastructure host billing model. Spans is relatively straightforward — you pay for each function call (“span”) recorded in the APM Trace, with 1 million free spans for each host. This sounds like a lot of free spans, but trust me, it’s not. A program can easily make thousands of function calls for the processing of a single web request, and you may get thousands of those requests, so 3000x3000 = 9,000,000 and hey look you’re 9 times past your free spans already. Make sure your developers don’t go wild with adding spans and keep a close eye on your span count or setup a automated monitor.

And More

There’s many more Datadog’s offerings, each with their own quirks. One simple way you can reduce the price of a offering is by committing to a monthly or yearly contract. This does not apply to all offerings. Ask your sales representative for the exact discount on each product you commit to. If you’re a moderately sized company committed to a plan, you should be assign a CSM. They are less biased towards selling you products than sales representatives, so they can offer you great insight on your DataDog usage and billing strategy.

Q: I’m lazy, can I just give Datadog their monthly wheelbarrow full of cash instead?

Yes, and this is not actually a stupid question. If you make enough money you don’t need to worry about the bill. Your time is better spent on improving your company’s product. Unfortunately, the ‘right’ amount of money to spend is very hard to nail down.

  • Honeycomb estimates 30% of your cloud spend is normal. If you’re skeptical about this, you’re right to be cautious. This is a biased figure as Honeycomb wants you to spend more money on them.
  • Random Redditors estimate 5–10% is normal.
  • finout.io say their customers (likely medium to large companies) do 2–6% of their cloud spend.

Q: What if I just… don’t use Datadog?

Sure, you could do that. It may even work out well for you. There’s a lot of good competition out there, like Grafana Labs and Lightrun. On the open-source side, Loki+Prometheus is a great choice. But keep in mind these three warnings:

  1. Other paid solutions can be expensive too. At the end of the day, all the companies are trying to get your money. I’ve personally also seen high costs with Dynatrace, Splunk, and Elastic. I’m not trying to call out those companies specifically, the point is you should always be on your guard.
  2. Self-hosted solutions require more maintenance. Deploying a simple stack can be done in a day, but getting that stack HA, bug-free, and integrated with all your services can take months. If you go this route, make clear to leadership that you will need more time or headcount. To get your price estimate, add together infrastructure cost and labor cost. Multiply your hourly wage by 1.2 or 1.4 to get the actual cost to the company, as your hourly wage does not include the bonuses and healthcare and taxes that your company pays.
  3. Datadog is simply one of the best solutions out there, and you pay accordingly.