Show HN: Infracost (YC W21): Be proactive with your cloud costs
Hi, we are Ali, Hassan, and Alistair, co-founders of Infracost (https://www.infracost.io/). Infracost helps engineers see the cost of each Terraform change before launching resources. When changes are made, it posts a comment with the cloud cost impact. For example, “you’ve added 2 instances and volumes, and change an instance type from medium to large, your bill will increase by 25% next month, from $1000 to $1250 per month”.
We launched in February 2021 (https://news.ycombinator.com/item?id=26064588), and Infracost is now being actively used by over 3,000 companies. However, there is a shift happening in the cloud cost management space. New teams, called FinOps teams (a combination of "Finance" and "DevOps"), are being formed within companies to manage cloud costs.
One of the first tasks assigned to these teams is to determine "who is using what" - that is, which teams, business units, products, etc. are spending the most on cloud. To accomplish this, they use tags. Tags are labels that all cloud resources should have and are key-value pairs. For example, a server could be tagged with: product=HackerNews; environment=production; team=blueTeam. So if resources are not tagged properly, then you can’t tell who is using what.
However, FinOps teams face challenges because their tools are reactive. These tools begin by analyzing cloud bills and providing visibility of tags from there. This means that they are looking at resources that are already running in production and costing money. A customer recently shared, “I want all resources to be properly tagged. But if they are not, I would rather a resource not be tagged at all than be tagged incorrectly.”
My "aha" moment! FinOps teams can define a tagging policy that can be validated in CI/CD before resources are launched. This is important because if code is shipped with the wrong tags, FinOps teams will have to fight for sprint time to fix them. Even if you shut down an untagged resource directly in the cloud, the next time Terraform runs, the resource will launch again with no tag. You need to fix the issue at its root.
I’d love your feedback on our solution to the tagging problem. You define your tag key-value policy in our SaaS product, and Infracost checks all Terraform resources per change. If anything fails the policy, it posts a comment with the details of which resources need tags, and what the allowed values are. Once fixed, it will let the code be shipped to production.
Try it out by going to https://dashboard.infracost.io/, setting up with the GitHub app or GitLab app, and defining your tagging policy. It will then scan your repository and inform you of any missing tags and their file and line number. You can use the free trial, but if you need more time, please message me and I’ll extend it for you.
I would also love to hear how others ensure that the correct tag keys and values are applied to all resources, and whether this is done proactively or reactively. Additionally, I would be interested in hearing about any lessons learned in the process.
Cheers I have used this tool in the past, though free tier only. It was easy to get up and running and easy to plug into a CICD pipeline. The problem we had with it in practice was that we largely preferred serverless technologies in AWS where the cost depended mostly or even completely on actual usage - things like Lambda invocations, SQS operations, or autoscaling ECS services, for example. In this case the estimates we got from Infracost were not very useful. Providing a meaningful cost estimate requires projecting usage, which is something that our development teams were very bad at, if they could be bothered to care at all. I like the idea of implementing tagging enforcement in the pipeline. In a perfect world you would use cloud policies to do this, but in practice this is a big loser in AWS where a staggering number of resources are created by one API call and then tagged as a followup API call, meaning an SCP to prevent launch of untagged resources won't ever work. Great point about the multiple API calls. One of the big problems we’ve heard about using SCPs is that they are too late. If a deployment fails because of them the developer needs to go through another pull request/code review. Estimating costs for serverless technologies upfront is definitely challenging. We're thinking of bringing in the last 30 days of usage for these resources to give engineers some visibility. I've not used the product, so it may already do this, but does it ask you for the data it needs in the Pull Request? I have experience interacting with a logging system, where any diff to the logged data would need a tag like `log_size_increase=3 bytes` – the CICD system would then turn this, with the data already available, into an estimate of the overall extra storage needed. Perhaps the same could be done. Rather than figuring out "usage" of some serverless systems, which is a very vague question and therefore hard to answer, perhaps it could be more specific. For example, how many requests per second is it expected to receive? Or, which other serverless functions call it (and therefore which will it necessarily have the scale of). Or, what increase in usage would be expected for this change. It's been a while since I used this tool, but as best I can recall there was a way to provide usage estimates to feed the variable cost calculations. The biggest problem we had was getting development teams to know and care enough to provide accurate numbers. The suggestion in the post below to provide 30 days historical data as a starting point could be a great way to have a meaningful baseline. If someone had better projections, they could provide them, but at least it wouldn't be a total crapshoot. > but in practice this is a big loser in AWS where a staggering number of resources are created by one API call and then tagged as a followup API call We have a bot at work that sends you (or a DL with a bunch of people) a nastygram if you forget to tag your resources, but it doesn't know this. So if CloudFormation isn't done, you'll get the email and then have to respond to everyone with a screenshot showing that you didn't in fact goof it up. I wonder if you can make it so EventBridge (or however it's implemented, I'm not sure) delay an event for 30 seconds so they don't actually look until CF is done tagging. I love it! But as human processes go, it will need to surmount the "flaky tests" problem of, "let's just turn off this test because it's flaky and we need to merge this branch". I guess that means FinOps teams will still have to fight to be heard, but I think you are helping shift a lot of their burden! What remains seems more like organizational dynamics, but what are your thoughts? Great point - indeed FinOps teams consistently rank "empowering engineers to take action" as their number 1 challenge (https://data.finops.org) - and by that they mean the human and organization dynamics of the culture change they want to create across the org. The testing analogy is a good one as this feature also shows the engineers the current "failing policies" on the main branch too, so whilst they could merge the pull request without fixing the tagging issue, it'll just get added to the list. And maybe like tests, they group them into one task and go through to fix them all every so often to get the main branch back to green! Nice! What did you start out doing, if you don't mind me asking? And how did you come to this, pivot, if that's what it is? We started out with the Infracost CLI showing engineers cost estimates in the terminal before they deployed their code. The learning was that it also makes sense to check for other things like tagging policy issues and best practices not being followed as these things are more actionable than showing engineers a cost estimate. The cost estimate is actually more useful to trigger notifications on, e.g. if an engineer is adding $10K worth of databases, let the engineering management or FinOps teams know so they're not surprised by the spike in the bill and can adjust budgets if needed. Cool! BTW do you know a great dashboard where I can compare VPS costs across all providers? For anyone following at home, once you've identified a test as flaky, your next action should be to turn it off. Nothing good comes from keeping flaky tests around. Detect them as soon as you can and either fix them _right there_ or skip them. I've used this in practice in a company of ~80 developers at the time, applied it because read about it in some Dropbox papers, and have since seen it work in 2 other companies. Skip your flaky tests!! I suppose the difference between flaky tests and typos in tags/missing tags is that the latter is less about flaky-ness, and more about the engineer deciding not to fix the tagging issue and merging anyway. In Terraform, tags are fairly easy to fix and don't require the resource to be recreated so it feels like it should be a quicker fix then fixing/refactoring tests. I think the easier we make it for engineers to fix tagging issues, the more likely it'll be for engineers to take action. Send me an email asking me to read the company's wiki page on tagging policy and I'll delete the email; tell me I have a typo on line 8 as soon as I open my pull request, I'll fix it and move on. I love how we just build complexity upon complexity. A tool for all the problems that this new tool gave that was solving all these other tools. A never ending mountain of complexity. In that sense coding (and hosting) is like the law. The entire ecosystem will just keep expanding in complexity decade by decade Counterpoint to this is the rate at which you can grab some new infra and try something out is amazing. You need a Redis cache? Or a GPU? Here have one in a few seconds, instead if next month or never. It is a bit like complaining about making tractor engines more efficient when you didn’t have those problems using animals and ploughs. You are describing corporate red tape for your organization. Those things were already possible before the cloud hype of last 5-8 years. Anyway the point is also not that cloud didn't solve any issues but that by solving 1 issue, you get 5 new issues, and you solve one of those and you get again 5 new ones. And the loop keeps going endlessly. > You need a Redis cache? Or a GPU? But how often are these things truly *necessary"? I think they seem more necessary than they really are because they are so readily available. How else will people make money (if they are a startup) or get promotions (if they are in a corporate). So as a developer advocate and a tinkerer with a little home lab, I am often setting up and tearing down infrastructure to test things out. I use AWS because that's what I am most familiar and I try to be super careful about not running up a crazy bill, but I am still occasionally caught off guard. Now I am thinking about testing out Infracost to help prevent unwanted and unintentional spending. As far as a solution to consistent tagging -- if I am understanding the problem space correctly -- something like Cloud Custodian could possibly help. It's open source and you can set up auto-tagging policies as well as use Cloud Custodian to backfill tags. These policies use lambda functions to respond to certain actions (ie, spinning up an ec2 instance, etc) and auto-tag with the resource creator/owner. Bingo - so Infracost will tell you before you launch anything how much it'll cost. Now scale that to a few thousand developers across a large company, and it's very impactful. Backfilling tags works, but the issue is if Terraform isn't updated, it causes drift - it's much better to fix it at the root, so that's what Infracost helps with Sounds useful! We're using service control policies to enforce tagging on certain resource types, and retroactively for the rest. Considering to use a "shift-left" tool as well, but it would need to support Terraform, CDK, Serverless and Cloudformation. Awesome, yeah we’ve seen people using this method and the main complaint we’ve heard is this is annoying for developers since it blocks their deployments when they run `terraform apply`, so they need to create new pull requests and wait for another code review. Combining both can definitely help with this. Not sure if with shift-left you mean specifically shifting left infracost and FinOps or general Infrastructure-as-Code shift left. In case it's the latter, I can recommend Spacelift[0] - a specialized CI/CD tool for IaC and supports all the tools you've mentioned. It basically helps you build policies and orchestrate your infra (don't want to go into too much detail in this comment) to scale it to bigger teams and setups. Policies to enforce tagging would indeed be a good example. It integrates with infracost too, but obviously just for the tools infracost works with, no CloudFormation. Disclaimer: Work at Spacelift so obviously take the recommendation with a grain of salt, but I do legitimately think it's a great tool. [0]: https://spacelift.io P.S. Congrats on the Show HN Infracost team! I've seen Infracost around and think it looks cool, do you have any plans to add support for Pulumi? One advantage Terraform seems to have over Pulumi is the ecosystem of tools that support it. Yep, for sure! It's on the roadmap. We are friends with the folks at Pulumi too. Love what they are building, so hopefully we will get some bandwidth and add support. And Cloud Formation too. Azure ARM ... haha there is a lot more to build :) I'm a a Pulumi employee, if you'd like to chat about how we can help add support, email me lbriggs[at]pulumi.com Hi, Azure admin here. The Azure Policy service includes a set of built-in policies to handle tags. There's one policy that requires new resource groups to be created with specific tags. Another policy allows resources within the resource group to inherit the same tags. I think this combination of policies would solve the tagging problem quite neatly, though I haven't tested it myself. Hi, I think the key issue with both the Azure policy, and the Amazon services is that they only work after a pull request has been merged. Then the build fails, and the engineer has to come back to their code, make a new Pull Request and then send it again, till it passes. That's the feedback we got from users, so with Infracost, the Pull Request itself tells the engineer what needs to be done, along with exact code line numbers etc before going any further, so everything is fixed within the same pull request. Also, it works across all cloud providers, so FinOps can set central tags in a uniform manner no matter where the engineers are launching resources. I’m not a target user, but you mentioned the tagging problem and git integration, perhaps you could infer at least the git user responsible for each resource cost (git blame the TF file and identify the username who added the resource) as a minimum amount of detail provided out of the box? Interesting idea! The pull request authors are shown out of the box but we hadn't thought of using git to find the user for each resource on the main branch. Most organizations end-up tagging the resources with some sort of owner or team so they can group the costs using that and track it per team/service/product over time. That's often how FinOps teams start to create a sense of ownership for cloud costs amongst teams. Would like to know more about how Infracost does dynamic cost estimation, for example if I allocate an S3 bucket I have no idea how much it'll grow to so what does it show? Or What ab out EC2 w/ batch, or Lamba? Does it force the developer to estimate the usage pattern, or...? Yes, usage metrics are set via a configuration file, which you can also check into git. Changes to resources as well as usage estimates contribute to the forecasted costs oh, this is interesting, we're currently using Vantage (https://vantage.sh) how would you say infracost compares? Vantage is awesome - I've talked with Ben (their CEO) a few times. There are a lot of tools that start from the cloud bills, and give you visibility of everything (Vantage, Cloudability, Cloud Health Tech, Flexera etc) - all of these tools are reactive in nature as they start from the cloud bills. Infracost sits where your code sits, and therefore it can be proactive; before anything is launched and costs money, it'll tell you how much it is going to cost. So if you have a budget of $1K, and you try to launch a 24xl instance, it'll tell you that you budget will be blown, before you've launched the resource. Making it all proactive. The Infracost Terraform Cloud run task is awesome! Thanks! Yep, we're partners with HashiCorp and worked on that integration early on :) Cloud costs to easily balloon out of control and I bet this is helping companies save money but this FinOps stuff also seems like something straight out an HBO's Silicon Valley skit. hahaha love it - it'll keep evolving, and get more interesting for sure. I just think back to Dev and Ops, and how it went from a culture to a job title; testing was another one.