Benchmarking Apache Kafka: performance-per-price

7 min read Original article ↗

Introduction

Our data platform provides managed services for building analytical platforms for large data sets, competing with other market solutions. In some cases, our approach can be more efficient and cost-effective than the alternatives. While we don’t cover every scenario, we are significantly cheaper and much faster in those we do, helping our clients cut infrastructure costs by 30-40% and improve analytics performance by 30 times or more.

To remain competitive, we regularly conduct internal research to identify and improve our strengths, ensuring even better deals for our customers. This article showcases one such study, comparing different environments for our Managed Service for Apache Kafka.

Currently, the DoubleCloud platform supports AWS and GCP as cloud providers. Both offer multiple compute generations and two CPU architectures (x86 with Intel and AMD, and ARM). In this article, we compare these setups using various Java Virtual Machines (JVMs) to evaluate the performance of new versions on newer processors. The ultimate goal is to find the most effective setup and achieve the best price-performance ratio for our managed service for Kafka.

If you want a TL;DR: ARM rocks. Modern expensive architecture does not always mean “better”. Click the link to jump straight to the results or proceed to find out more about the methodology and setup.

Methodology

We considered testing performance with our own service but wanted to compare it in different environments we haven’t supported yet. We wanted to check out new virtual machines, regions, and even other cloud providers. So, we started by implementing a toy project that uses baseline Kafka with different base container images. This way, we can run benchmark tools on specific hardware and measure performance.

We aim to test various configurations to identify the most interesting results. For that, we use the idea of the testing matrix to filter initial findings. We will analyze these findings in-depth using tools like perf and eBPF to refine performance further.

Testing cases

Let’s first describe our testing goals. Our team has a lot of experience with OpenJDK JVM, but today, there are many alternatives from Microsoft, Amazon, and other companies. Amazon Correto, for instance, includes extra features and patches optimized for AWS. Since most of our customers use AWS, we wanted to include Amazon Correto in our tests to see how these JVMs perform on that platform.

We picked these versions for the first comparison:

  • OpenJDK 11 (for a retrospective comparison, though it’s outdated)

  • OpenJDK 17 (the currently-in-use JVM)

  • Amazon Coretto 11.0.22-amzn (an alternative retrospective comparison)

  • Amazon Coretto 17.0.10-amzn (an alternative to our current version)

  • Amazon Coretto 21.0.2-amzn (a newer LTS version that should be better)

Once we agreed on the versions, I prepared a few scripts to build Kafka images using Amazon Correto and OpenJDK.

Image settings

For our benchmarking tests, we changed Kafka settings to focus on specific performance metrics. We wanted to test different combinations of [JVM] x [instance_type] x [architecture] x [cloud_provider], so it was important to minimize the effects of network connectivity and disk performance. We did this by running containers with tmpfs for data storage:

podman run -ti \
 --network=host \
 --mount type=tmpfs,destination=/tmp \
 kfbench:3.6.1-21.0.2-amzn-arm64

Naturally, this setup is not meant for production, but isolating the CPU and memory bottlenecks was necessary. The best way is to remove network and disk influences from the tests. Otherwise, those factors would skew the results.

We used the benchmark tool on the same instance to ensure minimal latency and higher reproducibility. I also tried tests without host-network configurations and with cgroup-isolated virtual networks, but these only added unnecessary latency and increased CPU usage for packet forwarding.

While tmpfs dynamically allocates memory and might cause fragmentation and latency, it was adequate for our test. We could have used ramdisk instead, which allocates memory statically and avoids these issues, but tmpfs was easier to implement and still delivered the insights we were after. For our purposes, it struck the right balance.

Additionally, we applied some extra Kafka settings to evict data from memory more frequently:

############################# Benchmark Options #############################
# https://kafka.apache.org/documentation/#brokerconfigs_log.segment.bytes
# Chaged from 1GB to 256MB to rotate files faster
log.segment.bytes = 268435456
# https://kafka.apache.org/documentation/#brokerconfigs_log.retention.bytes
# Changed from -1 (unlimited) to 1GB evict them because we run in tmpfs
log.retention.bytes = 1073741824
# Changed from 5 minutes (300000ms) to delete outdated data faster
log.retention.check.interval.ms=1000
# Evict all data after 15 seconds (default is -1 and log.retention.hours=168 which is ~7 days)
log.retention.ms=15000
# https://kafka.apache.org/documentation/#brokerconfigs_log.segment.delete.delay.ms
# Changed from 60 seconds delay to small value to prevent memory overflows
log.segment.delete.delay.ms = 0

Here’s a summary of the changes:

  • Log retention time is set to 15 seconds to remove data faster, and Log retention size is limited to 1 GB to manage storage in tmpfs. Log segment size is also changed to 256 MB to rotate files faster

  • The Retention check interval is reduced to 1 second to quickly delete old data

  • The Segment delete delay is set to 0 to prevent memory issues

This configuration is not suitable for production use, but it’s important for our benchmark tests as it reduces the effects of irrelevant factors.

Instance types

At DoubleCloud, as of the time of writing this blog post, we support these major generations of compute resources:

  • s1 family: m5a instances (with i1 representing m5 with Intel processors)

  • s2 family: m6a instances (with i2 representing m6i with Intel processors)

  • sg1 family: GCP n2-standard instances with AMD Rome processors

For Graviton processors, we support:

  • g1 family: m6g instances (Graviton 2)

  • g2 family: m7g instances (Graviton 3)

Additionally, we tested t2a instances on GCP as an alternative to Graviton on Ampere Altra. We don’t offer these to our customers due to AWS’s limited regional support, but we included them in our benchmarks to compare performance. These might be a good option if you are in one of the “right” regions.

Benchmark tool

For benchmarking, we developed a lightweight tool based on franz-go library and example. This tool efficiently saturates Kafka without itself becoming the bottleneck. While librdkafka is known for its reliability and popularity, we avoided it due to potential issues with cgo. Exploring the comparison between librdkafka and franz-go deserves a separate blog post. Join our Slack to let us know if you’d be interested in such content!

Test

Kafka is well-known for its scalability, allowing topics to be divided into multiple partitions to efficiently distribute workloads horizontally across brokers. However, we concentrated on evaluating single-core performance for our specific focus on the performance-to-price ratio. Therefore, our tests utilized topics with single partitions to utilize individual core capabilities fully.

Each test case included two types:

  • Synchronous produce: waits for message acknowledgment, ideal for measuring low-latency environments where milliseconds matter, such as real-time applications

  • Asynchronous produce: buffers messages and sends them in batches, typical for Kafka clients that balance near real-time needs with tolerable latency of 10-100 ms

We used 8 KB messages, larger than an average customer case, to fully saturate topic partition threads. Again, feel free to reach out on Slack if you feel like the tests should be improved or if you’re interested in exploring different configurations.

Results

We present a series of plots comparing different test cases using a synthetic efficiency metric to evaluate different architectures. This metric quantifies millions of rows we can ingest into the Kafka broker per cent, providing a straightforward evaluation of architectural cost-efficiency.

It’s important to acknowledge that actual results may vary due to cloud providers’ additional discounts. Whenever possible, the tests were conducted in Frankfurt for both cloud providers (or in the Netherlands in cases where instance-type options were restricted).

Charts

On all charts, we use conventional names for instances, the same their providers use. Instances are sorted first by cloud providers (AWS, then GCP) and then by generation: from older to newer.