This blog takes about 20 minutes to read, as the author includes extensive hands-on experience which is highly valuable and worth keeping for reference. The hands-on experience section is marked with “TL;NR” and feel free to skip this section.
The startup company mentioned in the title is our company, ApeCloud. Over the past two years, we have reduced cloud costs by more than 80% by leveraging FinOps tools, processes, and strategies.
Unlike larger enterprises, we didn’t allocate a dedicated team to a so-called “cost reduction campaign”, nor did they invest heavily in FinOps products or hire professional cost optimization vendors. Instead, we adopted an internal approach, forming a SIG (special interest group) from their development team to collaboratively build and iteratively optimize their FinOps processes. Throughout the journey, our primary goal was to minimize any disruption to development efficiency.
What is FinOps, and why do businesses need it?
FinOps (Financial Operations) is a cloud financial management practice aimed at helping businesses optimize their cloud spending while improving business agility and innovation. As cloud services become universal, companies are facing increasing cloud costs. FinOps helps manage and optimize these expenses through cross-departmental collaboration between finance, tech, and business teams.
Core objectives of FinOps
- Visibility: Providing transparency into cloud spending so that different teams understand their usage and associated costs.
- Optimization: Reducing unnecessary costs through improvements in technology, financial management, and processes to increase cost efficiency.
- Collaboration: Bringing finance and tech teams together to develop and implement cost management strategies.
Key principles of FinOps
- Team ownership: Each team is responsible for their cloud usage and associated costs.
- Balancing efficiency and cost: Controlling costs without sacrificing business efficiency or innovation.
- Continuous improvement: Continuously optimizing and adjusting spending as the cloud environment and business needs evolve.
FinOps combines the best practices of technology and financial management, enabling organizations to better predict, control, and optimize their use of cloud resources.
To use a basketball analogy, if a company’s core business is like NBA scoring legend Michael Jordan, then FinOps is like Dennis Rodman, the NBA’s all-time rebounding king, helping the company stay competitive in the long game.
How to successfully implement FinOps from scratch in a company?
This blog is aimed at those responsible for establishing FinOps from the ground up in their organizations, as well as relevant stakeholders.
Based on ApeCloud’s two years of FinOps experience, it is entirely feasible to transition from a state of “no metrics” and “lack of control” to a state of “full granularity” and “complete oversight”, while achieving the KPI goal of reducing cloud costs by over 50%.
Before reading this, you might already know that FinOps is a comprehensive system. Although there is an overwhelming amount of technical and online content on FinOps, most of these resources focus on isolated aspects and rarely provide a practical, system-wide approach that can be applied directly to frontline operations and drive significant change. So after reading these articles, you might pick up a small piece of FinOps, but your running cloud resources will continue to drain your company’s precious funds, unfazed by your newfound expertise.
Press enter or click to view image in full size
This blog aims to address the real challenge: how to successfully implement FinOps from scratch in your organization, helping you quickly become a hero who saves your company’s bills — even if you currently know nothing about FinOps.
Overview of ApeCloud’s cloud usage
Though small in size, ApeCloud is fully equipped. As a startup, ApeCloud’s cloud usage spans almost all mainstream cloud providers in the market, along with a private IDC (Internet Data Center) and an internally built Lab facility.
Press enter or click to view image in full size
We leverage cloud-native technologies for workload orchestration, including Kubernetes (K8s), Terraform, and Karpenter.
If your company’s cloud usage follows a similar pattern, you can probably relate to the complexities of multi-cloud management and cost optimization.
ApeCloud’s basic cloud usage principles
As a tech startup, we highly value an engineer-driven culture and efficiency. Therefore, the basic principle of cloud usage is: No approval needed, open access to cloud resources.
This principle greatly reduces cloud management costs while empowering all employees with cloud knowledge. As a result, we quickly developed a broad talent base capable of handling the complexities of multi-cloud usage and management. Leveraging this, we launched a multi-cloud and hybrid cloud database management platform, which allows users to independently select discounted instance plans from public cloud providers, cutting cloud database costs by up to 75%.
Many FinOps discussions fail to consider the true cost of managing cloud resources, often playing a “complexity shifting” game — trimming external fat without addressing internal inefficiencies. By adhering to the principle of “no approval, open access”, we implements FinOps with nearly zero management costs, using technology and simpler, more transparent management methods.
- Technical management methods: For instance, using a unified multi-cloud login solution. This allows all employees to access any cloud provider with a single click, without needing to individually create IAM users for each cloud. It also facilitates easier cloud usage audits and automated cloud resource billing, balancing cost, efficiency, and compliance.
- Simpler, open, and transparent management methods:
- Establishing Cloud Usage Guidelines, which all employees follow when using cloud resources.
- The Guidelines detail preferred cloud usage sequences for cost savings, available credits, discounts, general rules, and specific rules for each cloud provider. Even new employees, without guidance, can navigate cloud usage like FinOps experts and find the most cost-effective options.
For customers interested in these management methods, we can offer free technical guidance. You can join ApeCloud’ Slack to learn more.
How to automate FinOps implementation?
Having principles alone does not guarantee the success of FinOps. FinOps requires continuous efforts, and given the complexity of ApeCloud’s cloud usage paradigm, relying solely on each cloud provider’s management platforms is no longer sufficient.
To address this, we developed an internal FinOps tool: Apepipe. This tool automates unified FinOps management across multiple clouds and accounts, and it can be deployed on-premise. Currently, Apepipe has the following capabilities:
1. Apepipe Bot
By integrating with bots like Lark, Slack (and potentially Discord etc.), Apepipe enables interactive FinOps functionality.
- For instance, it can push FinOps bills and optimization reminders.
Press enter or click to view image in full size
- Apepipe Bot also supports ChatOps, allowing you to run all CLI commands, such as viewing or deleting cloud resources, or checking instance prices — all from Lark, Slack, etc., while relaxing on the beach.
Press enter or click to view image in full size
- With the integration of AI, managing FinOps through Apepipe Bot becomes even more efficient.
Press enter or click to view image in full size
2. Apepipe CLI
Apepipe CLI enables unified querying and management of multi-cloud resources from the command line.
Press enter or click to view image in full size
Apepipe CLI also integrates performance tools, such as the nuke command, which can clear out all resources from a cloud account when it's no longer needed, with a single action.
3. Apepipe Dashboard
The dashboard provides a high-level view of all cloud resources across accounts, and allows detailed searching of specific resources without having to switch between multiple clouds or accounts.
It offers a no-code customizable dashboard for full control.
Press enter or click to view image in full size
4. Apepipe Operators
These operators are deployed within K8s and act as executors for automatic resource optimization based on FinOps rules. The biggest enemy of FinOps is time — while resources are active, cloud bills accumulate every second. These operators continuously run in the background, eliminating wasted cloud resources and optimizing them based on rules without manual intervention. Examples include:
- Automatically cleaning up unused disks;
- Automatically converting AWS GP2 volumes to GP3;
- Automatically optimizing AWS EC2 instance credit settings;
- Automatically shutting down test cloud instances over the weekend;
- Automatically cleaning up unused logging projects.
Press enter or click to view image in full size
(TL;NR) ApeCloud’s FinOps Experience
Due to space limitations, below are some selected common FinOps practices. More detailed experiences will be compiled and shared later in the e-book FinOps Zero 2 Hero.
1. Cloud server costs
1. 1 Save 50%–90% with Spot Instances
For development and testing environments where stability isn’t a priority, Spot instances is recommended to reduce cloud server costs.
GCP’s Spot instances allow you to stop them without deleting them, which is especially useful. On the other hand, AWS Spot instances, when reclaimed by the platform, will also reclaim the system disk attached to the instance, making it impossible to recover any data stored on the disk.
GCP Spot instances cost only about one-third of the price of standard instances.
1.2 Use Reserved Instances for Persistent Needs
When selecting reserved instances, consider the following variables:
- Prepayment: Whether to pay upfront or at the end of the month. Prepaid options generally offer lower prices.
- Flexibility: Whether the instance type can change during the reserved period. The inflexible option is referred to as the “Standard” mode.
- Term: The duration of the reservation, typically 1 year or 3 years. The 3-year option is much cheaper.
- Scope: Whether the reservation applies to a specific region or zone. Zone-specific reservations are usually cheaper.
For well-defined, long-term needs, the most cost-effective option is to choose Prepaid + Inflexible + 3-Year.
1.3 Share resources whenever possible without interference
1.4 Use scheduled tasks to automatically shut down VMs
Set scheduled tasks to power down VMs when not in use at night and power them back on in the morning. This can also be done on weekends for additional savings.
1.5 Create machine images for temporarily unused instances
If an instance is not needed for more than a week, or if the block storage costs are high, it’s more cost-effective to create a machine image and delete the instance rather than just shutting it down. You can restore it later when needed using the image.
1.6 Use economical stop mode
Alibaba Cloud offers two stop modes. The standard stopmode continues to incur costs. When a server is not in use, it’s recommended to switch to the “economical stop mode”.
- Under the econoical stop mode, you won’t be charged for compute resources (vCPU and memory), fixed public IP, and bandwidth.
- However, you will still be charged for the system disk, data disk, elastic IP, bandwidth (in fixed bandwidth mode), and paid images.
1.7 Availability Zones also affect costs
A tip for selecting availability zones: When choosing between three shorter options and one longer one, pick the longer one; when choosing between three longer options and one shorter one, pick the shorter one; when split evenly, choose Zone B. Another insider tip is to select Zone C.
This trick is particularly effective when using Spot instances. Based on two-year analysis of AWS, GCP, and other clouds, the probability of Spot instance reclamation is much higher in Zone A than in Zones B, C, and others.
Thus, choosing the right availability zone can lead to additional cost savings.
1.8 Hidden gems in Spot instance reclamation messages
Spot instance reclamation is a probabilistic event. By analyzing the reclamation probabilities of different instance types and availability zones, you can adjust Spot deployments. This allows you to meet your needs with just 10% of the usual cost while maintaining a high level of stability.
2. Public IP costs
Almost all cloud providers now charge separately for public IPs. Therefore, if public access is not necessary, avoid configuring a public IP. An additional benefit of minimizing public IP usage is reducing the risk of malicious traffic attacks from the public internet.
3. Traffic costs
3.1 AWS charges for inter-AZ traffic as cross-region traffic within the same region
Real Case: A project team from a company was unaware that inter-AZ EC2 traffic within the same region incurs the same cost as cross-region traffic. As a result, they incurred over $10,000 in traffic costs over a few days due to multi-AZ communication.
Press enter or click to view image in full size
3.2 Use internal traffic as much as possible
Because public traffic is more expensive.
For example, when an Alibaba Cloud ECS instance reads from or writes to an OSS bucket, the cost of public traffic is significant. For a 1TB bucket, if accessed through the public endpoint, the traffic costs US$ 0.118/GB in the Hong Kong region. Reading the entire bucket would cost $120.
4. Object storage costs
4.1 S3: How to properly delete a large S3 bucket when there are too many files to manually clean
After testing on S3, we often need to delete the data. Before deleting the bucket, you’re prompted to remove all the data inside. But what if the file count is too large for manual deletion?
When dealing with many large files, use a lifecycle policy to delete them. Avoid directly deleting files, as this can incur significant traffic costs.
s3api put-bucket-lifecycle --bucket %s --lifecycle-configuration '{
"Rules": [
{
"ID": "delete-all",
"Status": "Enabled",
"Prefix": "",
"Expiration": {
"Days": 1
}
}
]
}'4.2 Alibaba Cloud OSS: ZRS is 25% more expensive than LRS
If there are no special requirements for development and testing, it’s better to use LRS (locally redundant storage) instead of ZRS (zone-redundant storage), as ZRS costs 25% more.
4.3 Schedule regular cleanup of unused buckets
Pay-as-you-go object storage can accumulate substantial costs over time. Regularly clean up unused buckets to save on storage fees.
Aside from the storage cost itself, the bigger danger is the “Dark Forest Law” of public cloud: if your S3 bucket name is exposed, anyone can potentially spike your cloud bill.
5. Block storage costs
5.1 Convert block storage to snapshots for stopped instances to save costs
If an instance is stopped, consider converting its block storage into snapshots to reduce ongoing costs.
5.2 Alibaba Cloud block storage: PL1 performance level is twice as expensive as PL0
In Alibaba Cloud’s ACK, the default SC for essd is set to PL1, which is twice the cost of PL0. For non-performance testing needs like development and testing, PL0’s performance is usually sufficient. You can configure a new SC for essd in ACK with the parameter performanceLevel: PL0.
6. GPU costs
6.1 Use alternative GPU models
The Nvidia L4 instance is much cheaper than the A100, yet its capabilities are even higher than the A100. Consider this alternative if your workload doesn’t require high bandwidth.
Press enter or click to view image in full size
Press enter or click to view image in full size
6.2 Cleverly handling GPU quota limits in GCP
GCP’s billing is project-based, and different projects can be linked to different payment methods. GCP’s quota is related to the billing account. For example, a company’s GCP account, paid by a vendor, had a quota of 16 A100 GPUs. To use this quota and still take advantage of GCP’s startup credits, the following was done:
Create a new project, link it to the vendor’s billing account to receive the 16 A100 GPU quota, and then switch to the startup credit billing account.
Additionally, on-demand and spot quotas are independent of each other. If one type of quota is exhausted, try switching to the other type.
7. Snapshot costs
7.1 Snapshot archiving
For snapshots that need to be retained long-term (over 90 days), archiving them can save 75% of the costs. For more details, refer to AWS snapshot archive documentation.
8. Logs costs
8.1 Disable unnecessary log generation
For example, cloud-managed Kubernetes services like EKS and GKE generate logs by default. On any cloud platform, whether it’s AWS CloudWatch or GCP Logs Storage, logging costs can quickly add up. For instance, a company’s GKE project generated 10 TB of error logs, costing $6,877.65.
8.2 Shorten log retention period
The storage cost per GB for logs is much higher than that of object storage. Therefore, reduce the retention period wherever possible. For example, GCP log storage allows you to set the retention period to as short as one day.
9. Container service costs
9.1 Keep EKS K8s version up to data
Ensure your EKS clusters are running the latest versions of Kubernetes. If not, you may incur extra charges for expanded support.
10. Tips for using AWS
10.1 CPU credit usage
In the case of T3 or T4g instances that support CPU bursting, if you fully utilize CPU capacity, it consumes CPU credits. Once you exhaust the CPU credits allocated to your account, you will be charged extra at 0.342 CNY per vCPU-hour of T3 CPU credits. For typical development and testing, it’s recommended to set the instance to “standard” mode under the “CPU Credit Specifier” setting to avoid additional charges.
10.2 Optimize KMS costs
When using Terraform to create EKS, the KMS key created alongside the EKS cluster has a default deletion window of 30 days. You can reduce the cost by 75% during the key’s deletion period by adding the following configuration in Terraform to shorten the window to 7 days:
deletion_window_in_days = 711. Tips for using GCP
11.1 Choose us-central1 for the Cheapest VM Pricing
Consult Google’s official pricing documentation to check the most affordable options:
11.2 Best Practices for running cost-effective Kubernetes applications on GKE
To optimize costs, refer to GCP’s guide on cost-effective Kubernetes practices.
11.3 Recommended instance type of GCP
We also compiled a spreadsheet of GCP instance types based on the information provided by GCP and marked the recommendations. Ensure that the instance types you use are optimized for cost-effectiveness based on FinOps recommendations.
Press enter or click to view image in full size
12. Own your data center and private cloud
To enhance stability and backup capabilities, ApeCloud adopts a multi-cloud architecture, currently supporting AWS, Azure, GCP, Alibaba Cloud and more clouds with a partnership with Sealos. At the same time, we also invest in its private cloud. Though deploying private cloud machines is more complex than using public cloud services, we see it as a long-term investment that can reduce annual cloud service costs.