At loveholidays, we rate the importance of observability rather highly. Having deeper insights into the performance and behaviour of our apps leads to greater reliability and a better understanding of our systems. This is crucial for any tech environment which progresses rapidly and strives for excellence. It is hard to overestimate the usefulness of metrics. The two main challenges around this topic within the context of this post are — how to store and how to query. We believe that historical data has a lot of potential in comparative analysis, so our goal is to collect and store as much information as possible while being mindful of the cost, efficiency, and effort.
Why Thanos is not the way forward for us
Prometheus and Thanos have served us diligently since 2019 (which in a constantly changing technological environment seems like forever). Currently, our production Thanos bucket stores 23 TB of metrics. So the first obvious question is why this change was needed. Why can’t we just continue using Thanos?
Press enter or click to view image in full size
Our dissatisfaction has been growing slowly but steadily. Countless days of engineering time were spent on scaling Thanos. Some of the problems that we experienced were:
- Query timeouts
- High resource usage (27 CPU and 270 GB of RAM), yet queries feel sluggish
- Thanos was lacking performance metrics to help us to optimise it
- Thanos compactor has repeatedly run out of disk space
On top of that, we had an incident when we lost 6 weeks of historical observability data. Due to misconfiguration, Thanos stopped sending metrics to the GCS bucket. However, we had a generous 28-day buffer in Prometheus, hence our regular dashboards kept working fine. One day a historical query revealed a frustratingly long gap of no data at all. As you can imagine, that was highly unsatisfactory. While we are guilty of this misconfiguration, the two-layered setup of Thanos and Prometheus has significantly delayed the discovery.
So, when we started onboarding our apps to Linkerd, it became apparent that with a new high-volume source of metrics, our Thanos is like a bowl of ice cream on a hot sunny day — at substantial risk of melting. We have 5 million active series: 3 million of our core metrics are in Thanos and 2 million are coming from Linkerd. Having already invested significant effort into optimising our Thanos implementation to be stable at our current scale, and without any ideas on how else we can optimise, we decided to investigate alternatives.
Press enter or click to view image in full size
Additionally, we longed to increase the sampling rate from the 30s to 10s intervals. We knew that Thanos would require considerable scaling effort to handle a 66% increase in active series at a 3 times more frequent scraping interval. We would also risk observability outages while figuring out scaling. So we decided that it would be best to ingest Linkerd metrics into a new metrics store, which would protect core production observability while allowing us to experiment with the latest tooling. We have been using Grafana Labs tools for quite some time: Grafana, Loki, and Tempo — all are deeply integrated into our observability stack. The most prominent candidates for Thanos’s replacement were VictoriaMetrics and, naturally, Grafana Mimir.
Other alternatives: why not VictoriaMetrics
First, I would like to state that VictoriaMetrics (VM) is awesome and a lot of people are praising it. VM promises lower CPU, memory, and storage (bytes per sample) usage as well as faster P99 query latency. There is a great article on a comparison between VM and Mimir that I can recommend (vendor benchmark warning). We estimated the monthly cost of infrastructure used in the VM benchmark to be below $1300/month. At our scale, the delta in resource usage cost is not large enough to sway our decision.
VM doesn’t support remote object storage and is designed to use a local filesystem for the sake of query speed performance (only snapshots can be sent to backend storage) whereas Mimir stores data locally for 2 hours and then uploads it to the object storage. Our goal is to offload all the state from the GKE as we believe it will greatly simplify our operations. We weren’t particularly keen on managing yet another bunch of multi-terabyte disks.
But even considering Mimir being more resource-hungry, moneywise, our analysis reveals that regional persistent disks (and we want regional, let’s not forget about high availability) will cost considerably more than GCS. We have modelled metrics costs for the next 3 years using the current 1 TB per month ingestion with 5% monthly growth. According to our calculations, the potential monthly average storage cost is $825 for GCS while persistent disks would cost us $3446. In the worst case, if we accommodate such nuances as the fact that VM needs 4x less storage space (referring to the aforementioned article but, keep in mind, that benchmarking was based on Mimir block compaction range of 24h although 12h and 2h are available), Mimir would be cost-neutral to VM in terms of storage.
So, taking into account that we are more familiar with Grafana Labs tools (and they have lots of similarities in their architecture which makes our observability stack more uniform) and, thus, following one of our engineering principles of investing in simplicity, we decided to focus on Mimir.
How Mimir runs in our clusters
Press enter or click to view image in full size
Following our approach with Loki and Tempo, we deployed Mimir in microservices mode to our development, staging, and production GKE clusters using the mimir-distributed Helm chart. It was a straightforward and relatively painless process (or, perhaps, it feels that way retrospectively, who knows). It took us a bit of time though to adjust the number of replicas that we need for some components and various configurational values according to our load. I will not delve into Mimir architecture here describing each component’s purpose as this is conveniently available in the official Mimir documentation. However, I will share some specifics of our environment and how Mimir fits into it.
Storage
We use a GCS bucket as an object storage backend which allows us to store our metrics indefinitely and cost-effectively. The bucket gets approximately 1 TB of metrics every month.
common:
storage:
backend: gcs
gcs:
bucket_name: the-greatest-bucket-in-the-world
blocks_storage:
backend: gcs
storage_prefix: blocksNetworking
We run all Mimir components in one zone as otherwise there is a lot of cross-zonal networking (10+ TB/day). However, to make sure that we can tolerate zonal outages, we are using regional disks. We allow stateless components to run on spot instances and haven’t noticed any issues so far.
Scaling
We have modified configurations for:
- Distributor — 8 replicas. Do note that distributor needs very little memory to run but it consumes a lot of it on start-up (we had OOMs when the memory limit was based on a steady-state consumption)
- Ingester — 16 replicas; combined usage of around 90 Gi of memory and 10 CPUs.
- Querier — 6 replicas; very quiet when no active querying but we’ve seen spikes up to a whopping 80 Gi of memory when in use.
- Nginx — 3 replicas for high availability.
It might sound as if Mimir is quite resource-consuming, yet, we provision Thanos with 10% more memory, despite it handling only half the samples per second.
Get Alina Frolova’s stories in your inbox
Join Medium for free to get updates from this writer.
With our load in production, some default limits had to be raised dramatically to avoid some query and ingestion errors.
limits:
max_label_names_per_series: 120
max_global_series_per_user: 12000000
ingestion_rate: 400000
ingestion_burst_size: 8000000Remote Write
Mimir works with Prometheus using Remote Write APIs. Bear in mind that Remote Write doubles Prometheus memory usage when enabled.
remoteWrite:
- url: http://mimir-nginx.mimir.svc/api/v1/pushDashboards
Mimir Mixin dashboards are a great help in terms of visibility into how your Mimir is running. They require quite a bit of tuning if you are using Helm as your deployment method for Mimir. Though this effort is worth it.
Press enter or click to view image in full size
Double trouble
Our journey with Mimir wasn’t all unicorns and rainbows. Once Mimir was up and running we noticed a mysterious 2x increase in metrics. Turns out Mimir doesn’t do deduplication by default. As we are running highly available Prometheus (2 replicas), we’ve been getting duplicates of all the metrics, and that negatively affected our Linkerd dashboards. We’ve noticed a 45-second delay in data consistency in queries between Linkerd Prometheus and Mimir. That is, in fact, how we first realised that something isn’t right.
Grafana Labs has a useful guide that explains how to configure Grafana Mimir High Availability Deduplication. But we have stumbled upon several obstacles on our journey so I would like to share some practical insights.
etcd
#our configuration for Mimir which enables HA tracker for distributor
mimir:
structuredConfig:
distributor:
ha_tracker:
enable_ha_tracker: true
kvstore:
store: etcd
etcd:
endpoints:
- etcd.mimir.svc.cluster.local:2379
username: super
password: secretIn order to make HA deduplication work, Mimir distributor HA tracker has to have access to a key-value store which can be either Consul or etcd. Thanks to the KV store the distributor understands which Prometheus replica is elected as a leader. We’ve deployed etcd through the Bitnami Helm chart and … it didn’t work. The issue that made us scratch our heads was manifesting itself in the constant annoying errors in etcd logs.
Those errors were “invalid auth token”, “retrying of unary invoker failed” and “failed to parse a JWT token”. Eventually, we made them go away with a ridiculously high TTL for the etcd token. Perhaps not the most graceful solution but effective nevertheless.
# token config for our etcd
token:
ttl: 10000hReplica labels
For Mimir to deduplicate metrics, it needs to identify which Prometheus instance a particular time series has arrived from. This can be achieved with external_labels configuration in Prometheus. We are using Kube Prometheus, so it has prometheus_replica label enabled by default and we add our custom lh_cluster label:
global:
external_labels:
lh_cluster: web-gb
prometheus_replica: prometheus-k8s-0In our case, Mimir deduplication uses the label value of prometheus_replica key to decide which of two copies of time series to keep:
accept_ha_samples: true
ha_cluster_label: lh_cluster
ha_replica_label: prometheus_replicaOnce we’ve enabled deduplication, query lag has disappeared, and we halved write requests and memory usage.
Press enter or click to view image in full size
Press enter or click to view image in full size
Why we love it
Performance
We no longer hear reports of timed-out queries. We measured the difference that Mimir makes with an inefficient query like count({job=”linkerd-proxy”}) and got 4 times faster results. It takes 27.2s to run it with Prometheus, but only 6.3s with Mimir.
Scalability
When running without deduplication we were ingesting twice the normal amount of data. This means that with minor tuning, we’ll have a system that will scale with growing demands. After all, this is what powers Grafana Cloud behind the scenes. We are now ingesting 300k samples per second.
Press enter or click to view image in full size
Cost efficient
The Compute (CPU, memory) cost of our Mimir cluster is in the region of $1300/month, only marginally higher than our Thanos setup which does not perform as efficiently. Over the next 3 years we anticipate an average monthly storage cost of $825, but it is worth noting that we forecast storing 100TB of metrics in the end. Yet, at the moment with under 1.5TB of data in GCS our monthly storage cost is only $30. When compared with alternatives, our 800 billion monthly samples Managed Service for Prometheus would cost us $72000/month for 12 months of retention.
We can be more efficient with storage by being selective with the metrics we capture and store, yet, we choose not to do it as we risk losing valuable insights.
Confidence
In the past, our team never felt that we’d mastered our monitoring as, no matter how much we learned about operating Thanos, it kept occasionally failing us in novel ways. With Mimir, we feel in control and that we understand its operational limits much better. The aforementioned dashboards, documentation, community, and Mimir’s own metrics do help a great deal with it.
What’s next?
Thanos deprecation
Thanos has collected 23TB of metrics since 2019. We are still writing to it, meaning some of our data persisted twice, even if it isn’t queried. We are planning to stop writing to Thanos by removing Thanos’s sidecars. Next, we will update Grafana dashboards to only use Mimir. Thanos’s read path will remain available for historical queries. We won’t be migrating data immediately as the process seems to be lacking in tooling and documentation. We will keep an eye on any new developments in migration tooling.
Autoscaling
At the moment we are running a static number of replicas for Mimir components. This goes against our own applications which often scale from 3 to 100 pods daily. Mimir documentation mentions autoscaling using JSONNET and KEDA (something we use already for our business applications). However, users of the mimir-distributed Helm chart are currently out of luck as per this GitHub issue. Once implemented, autoscaling will likely halve the cost of Mimir as our system load differs significantly between day and night.
LGTM
The ancient Norse meaning of Mimir is “rememberer” or “the wise one”. According to Britannica, Mimir is the wisest of the gods of the tribe Aesir in Norse mythology though he had an unfortunate destiny of being decapitated. The story goes on with Odin (the mightiest of them all) preserving the head and gaining wisdom from it. So as Odin once did (though we, thankfully, don’t have to deal with a decapitated head), we too are getting wiser with all the metrics that go through Grafana Mimir in our system. With Mimir, we have found an infinitely scalable, highly available, and powerful solution that is capable of withstanding not only our regular load but also additional streams of metrics coming from Linkerd. We now too have our M in the LGTM stack.
22/02/2022 Update: We’ve previously stated that we tuned the replication factor to 2 (instead of default 3). Don’t repeat our mistakes. This can lead to an ingestion outage which, subsequently, can lead to metrics being lost. Check out this GitHub issue to learn more.
Press enter or click to view image in full size
If you share our passion for observability and enjoy making high-load services more reliable, we are hiring Site Reliability Engineers.