Saving millions on logging: Finding relevant savings
product.hubspot.comIt feels like logging is misunderstood. Critical revenue or audit logs need to be centralized, but debug logs don’t. Logging debug logs to local storage and deleting it after nobody looks at it-the lifecycle of at least 99.999% of informational log statements-costs almost nothing. Another benefit is that pushing your predicate out to your edge nodes works far better than trying to get acceptable performance from central logging facilities. So I don’t understand why people waste so much money on centralized informational logs.
Yes, shipping computations instead of data is a reasonable design goal. Your proposed system only works when the predicate is independent across all logs though, correct? If you have to correlate or join your logs to anything, then this model becomes more complex. Not to mention, you're adding an additional performance tax to your prod machines which could be more costly than shipping logs to a centralized store. (A team should profile and make a tradeoff decision appropriate to their context.)
Additionally, what happens when we want to correlate these logs with tens of other systems?
I guess I don't agree that distributed log analysis simplifies the problem any more than centralized log analysis does. If the primary concern is cost, then you can save equivalent amounts of money with a different lifecycle policy for centralized logs.
EDIT: Btw, don't get me wrong, you are asking the right questions that HubSpot's performance team should be asking. The first phase of a cost savings program should observe benefits against cost, or stated another way, requirements vs cost. You're asking the right question, i.e., uhm, how do we actually use this data after we log it? I find it striking that this cost analysis didn't say anything about the end-user's use cases or benefits. Sure, we can optimize a system and save 40% the cost, but what if no one is using the system? Then we could save 100% the cost.
I once worked on a system where we were told to keep 18 months of debug logs (something that would have cost about $2k/month). When we pushed back and asked why, the answer was that occasionally (every month or every other month) there would be some customer issue that would need investigation that might result in a customer refund of $20-50 dollars.
Setting aside that the human time required for the investigation was probably close to $40-50, it was still not a slam dunk to get the business to shrink retention to a few days for critical debug.
Like markets... executives can stay irrational longer than you can remain sane, sometimes.
Anything that could lead to a customer revenue dispute is a critical audit log and needs to go to gold-plated log storage. But you will also be paying attention to optimizing costs for that, and the volumes will be relatively rare compared to application informational logs.
Seems like the kind of situation where you shrug, agree, compress and ship those logs off to cold storage to meet the requirement for a fraction of the price.
That's a fair criticism in the edit. Part 2 will cover that a bit more. I did run analysis on the types of queries users ran against the data and what parts of the timeseries were used, which informed a bit of our solution. I don't want to give away too much, but lifecycle retention adjustment ends up being relatively lower value (but still worthwhile) compared to general space savings.
Thanks for the response. I was responding to the comment more so than advocating for adjusting your log retention. :) Looking forward to part 2.
Are you able to reconcile some of the numbers and calculations in the article for me? (Understanding that you don't want to reveal any confidential info.) I see:
- 31 PB data + 10 PB application logs = 41 PB logs (uncompressed json) costs 7-figures (say ~$5M)
- 41 PB logs * 5% ORC compression = ~ 2 PB logs (compressed ORC) costs low 6-figures (say ~$300k)
I don't know what timeframe that cost is measured over. But that brings us to $300k / 2 PB = $0.15 / GB which is far above S3's quoted costs so I must be missing something.
The costs I talk about in the estimate in this post are for the remaining cost of each stored file. We have S3Inventory dumping metadata of all the files in specific buckets weekly, so I had written a job that calculated the exact remaining cost of each file, accounting for lifecycle events like moving to infrequent access storage in S3 and the eventual deletion of the file. So it’s sort of the “potential energy” version of the cost of our stores files. If we take no action they will aggregate to a certain amount of money.
I reckon you may be looking at the monthly cost of storage per gigabyte which is why the number doesn’t seem to make sense. Our retention policy started off at about 2 years, so the remaining lifetime per file amortizes out to much more than 1 month.
Also worth considering that we have a custom AWS contract, so none of our actual numbers are the publicly advertised rates and probably won’t entirely math out if you try to ballpark with those numbers.
Thanks for the clarification!
Because you can’t sell that as a saas product
I am lost at how you can have 20% of your storage costs be for logging, and not immediately say that at minimum you are persisting too many logs, and probably logging too much in the first place.
I get that modern tech companies log every movement and interaction a user has with an app, far beyond any amount that is reasonable, but surely at some point you can go “we probably don’t need this”.
It shouldn’t be a matter of “let’s compress the logs”, it should be a “are we even using these logs”.
You’re right, it’s not an either or for this as we tackle both less total data and making it smaller, although I probably failed to clarify that in this first post.
At 20% of storage costs, the should makes a lot of sense to focus on. Once it becomes 1% of storage costs it’s maybe not as problematic though. The magnitude to which, “let’s compress the logs” changes how much something like “am I logging too much” matters is important. Taking it to the absurd, if logging storage were free why not retain all logs. And if logging is cheap, why invest in complicated guardrails for what qualifies as important logs.
A specific consideration for us is organizational inertia. We have a lot of teams using infrastructure in a lot of ways both intended and unintended. One thing, for better or worse, that has been emphasized for us is developer velocity. Which includes things like abstracting “do I need to log this” from most engineers. We have some guardrails to alert if you do some egregious volume.
I think we do often opt for non-invasive infra solutions first because they have much shorter delivery times and less risk of stalling on long-tail outliers. They avoid very expensive organizational costs of buy in and team-level migration. I’m not suggesting this is the best organizational model, but it also transcends one team’s influence.
That ends up circling back around to the start of the problem. If we can transparently reduce the cost burden of some heavily-used internal infrastructure within the same relative magnitude of applying a paradigm shift on the usage of the same internal infrastructure, the former wins out.
I agree with you, but never underestimate fear of litigation. Legal and HR departments everywhere love to have extremely detailed logs for years that can refute ex-employees claims about one thing or another.
Using logs to derive telemetry is a short term strategy, not a long term one. This degree of logging suggests they’ve been doing everything with logs.
Hubspot offers hosted websites and landing pages (including analytics) as part of their product, so it's not just internal facing data.
Unlike JSON, orc requires batching of rows to write to disk. It's because it does a lot of computation - maintaining indexes, encoding columns (run-length, dictionary), calculating statistics, maintaining bloom filters, compressing columns etc. Doing this at the source where you are more interested in serving an individual request as quickly as possible doesn't look like a good idea. If you want the orc files to be useful, you need to batch a lot of rows together otherwise you don't get the benefits of columnar storage. So logs in the happy path will be delayed, and in the unhappy path if the process crashes, recent logs are gone. JSON isn't really bad as a logging format. And it can be stored temporarily to then asynchronously convert to a columnar format.
I'm looking forward to the next post.
Intercepting network traffic like this is an interesting approach to the problem.
If each service has a unique IAM role, which it definitely should do, wouldn’t you be able to track this via a combination of cloudtrail and proper resource tags?
(Author here). Yes I believe you are correct with regards to tracking application utilization of say EC2 and other AWS resources.
The post fails to mention this system is also tracking internal data dimensions like customer ids, such that we can also use this sampled data to estimate the cost of customers (and joining that with tiers of customers, and so forth).
I'm also not sure if that would allow us to attribute the cost of our datastore utilizations since those are not AWS-hosted versions but ones we run ourselves. The traffic interception lets us be able to say that Application A is using 75% of database cluster XYZ, and therefore that application/product group are most likely responsible for that share of how much the database costs.
The last thing I'll mention is that CloudTrail has the potential to be expensive on its own, I believe at least moreso than us storing the raw data in S3 for something like Athena to read. I don't think I'll be writing about it, but we've also done work this last year to trim down what we track in CloudTrail due to the cost of events (for example tracking everything in S3 ends up being pretty expensive).
When it comes to shared resources like a database cluster, you’re making the assumption that usage is correlated with number of connections.
Is this always true? Typically the shared resources you care about are CPU, memory and disk. I would say an application issuing fewer, much heavier queries is using the shared resource more than an application that issues more really simple queries. And this doesn’t correlate much to disk usage right?
There isn’t really a good solution to this. You can use a combination of query sampling and per-app databases to correlate this better.
Great post though, this is something we’ve been dealing and experimenting with recently.
Your observations are correct. I wouldn’t portray it as an ideal system, just best effort. In the end we care more about the finer details being good funnels to follow up on rather than being exact. We know our real costs of resources, which is important for finance and budget. And then we have the approximate attributions from the sampling which narrow things down enough to focus diagnosis when needed.
I would have to read more into how it intercepts some of our database calls to confirm if it tries to weight for execution time where it wires into database client code, which is probably useful and could help to a degree to approximate utilization.
I think in practice it’s a bit uncommon for the heaviest user to also be a sparse user in terms of volume. But I can also admit there are quirks to how it samples and I once personally spent a couple days tracking down a surprising cost of an application I owned only to later confirm it was a data flaw in how we were doing this sampled attribution (in this case the heaviest users were un-instrumented infra processes that can’t just wire in our java cost attribution library, making it artificially look like my app was the heaviest user).
I would like to know more about the kubernetes databases, what kind, challenges, how are the fault domains configured and etc.
Also is not clear to me how intercepting calls helped you to figure out the offending services?
I would have to defer to one of my colleagues for most of the details on running data infrastructure on Kubernetes, I’m not that close to that domain. The major ones we run are HBase, Vitess (MySql), Kafka, Elasticsearch, Memcached, and Zookeeper.
As for the call sampling/interception, that did not factor into discovering the high cost buckets in the logging case study. It was mostly relevant to generally describing how we track costs and it ends up being useful in other scenarios. For example it could be used to assess the estimated unit economics of customers subscribed into a specific product tier.
We also have the death star microservice model, so even relatively simple attribution can be helpful when you want to run a query like “for my team which owns 30 applications, tell me the monthly attributed cost grouped by resource” and that will be able to return all the associated database and cloud costs.