Press enter or click to view image in full size
(Update 2021: this technique is no longer relevant. You are better with with gp3 if you need additional IOPS or throughput.)
Earlier this year, I helped load test a gunicorn application on an EC2 instance. This was a 5th generation EC2 instance running a modern Linux distribution, and gunicorn was writing to log files on a single EBS gp2 volume. To our collective awe, we noticed that it was I/O-bound on logging to files on disk.
Even more surprisingly, changing that EC2 instance to use a RAID0 configuration over two EBS volumes worked really well, and we were able to double the number of gunicorn workers on that EC2 instance until it hit the next limiting factor.
Logging in Python is Synchronous
In both Python 2 and Python 3, the default logging handlers are synchronous and I/O-blocking. This is important to keep in mind. Furthermore, the StreamHandler will also acquire a lock for the duration of the blocking write. This is bad news for latency and concurrency, especially if you have a multi-threaded application that is writing to a single log file on disk via a shared FileHandler.
def flush(self):
"""
Flushes the stream.
"""
self.acquire()
try:
if self.stream and hasattr(self.stream, "flush"):
self.stream.flush()
finally:
self.release()There are a few ways to migrate your application to async logging instead, such as using the built-in QueueHandler in Python 3, aiologger, logbook’s ThreadedWrapperHandler, something more drastic like ZeroMQHandler. Nonetheless, several of these require significant changes to call sites, and the QueueHandler is only available in Python 3.2. Not all of us are that lucky.
gevent Does Not Switch Out On Disk I/O
I realized that, when I first learnt about gevent, I had simply memorized “gevent will switch greenlets when it hits blocking I/O”. I had also foolishly assumed that blocking I/O includes disk I/O (obviously), but never really tested my understanding.
It took me a while to notice that all of the examples in gevent’s documentation used network sockets e.g. querying DNS, or writing to a TCP socket.
It turns out that gevent does not switch greenlets when writing to disk.
When using gunicorn with gevent workers, this means that all of the greenlets within the same process are:
- using the shared FileHandler to append to the same log file,
- acquiring a mutex before writing to disk,
- holding on to the lock while waiting for the flush to complete, and
- pausing while the bytes are sent over the EBS network.
This happens on each call to logger.info, while clients on the other end are waiting for a HTTP response.
Press enter or click to view image in full size
The Resource Perspective: Interpreting CloudWatch
Before trying to optimize the logger, we wanted to prove definitively on CloudWatch that we were hitting resource limits on the single EBS volume. This endeavor was unproductive. The VolumeQueueLength, VolumeWriteBytes, and VolumeWriteOps metrics were lossy (probably sampled?), and often failed to capture the short-lived I/O spikes we were looking for. Moreover, we could not find any sort of ceiling or plateau in the graphs, which would have demonstrated that we were hitting some EBS cap. We were also not dipping into our BurstBalance.
Get James Lim’s stories in your inbox
Join Medium for free to get updates from this writer.
We later learnt that both EC2 and EBS had multiple layers of limits, which confounded the investigation.
- Each EBS volume is limited by IOPs (3 IOPS per GiB) and throughput (250 MiB/s), and these depend on the I/O size and the volume size.
- Each EC2 instance is limited by IOPs and throughput, and these also depend on the I/O size and instance size.
For EC2 instances that are smaller than 4xlarge, the throughput limit comes with the following caveat:
These instance types can support maximum performance for 30 minutes at least once every 24 hours. If you have a workload that requires sustained maximum performance for longer than 30 minutes, select an instance type according to baseline performance as shown in the following table.
Nevertheless, we found that Average Write Latency (ms/Operation) increased as we ramped up the load test, which was a sign of resource contention on EBS. There were also enough clues to infer that, on some HTTP requests, gunicorn was spewing enough writes to disk to hit the throughput limits (but not the IOPs limits) on a single EBS volume. These spikes often lasted less than 1 second, but were enough to cause degradation on the response times of the affected requests.
Avg Write Latency = (Sum(VolumeTotalWriteTime) / Sum(VolumeWriteOps)) * 1000The Workload Perspective: Analyzing the Host
Working on the EC2 instance, we ran the classic tools (iostat, dstat, iotop) and worked to isolate the problems. We already knew that %iowait was high; specifically, we wanted to determine the processes that were using disk I/O, and the amount of write throughput we needed so that we can size the EBS volumes appropriately.
These tools provided further evidence to support our hypothesis that gunicorn had disk writes that briefly spiked above the EBS limits, and also helped us identify and cut waste, such as disk writes from auxiliary processes (e.g. sendmail), and duplicate logging (e.g. supervisord logging, docker logging).
I also attempted to use the new bpf tools, but did not understand bpf enough to interpret and use the results.
RAID0 to the Rescue
The recommended workaround to EBS throughput limits is to use RAID0 with mdadm to stripe writes across multiple EBS volumes. Each EBS volume should be sized to provide the maximum 250 MiB/s throughput. Striping doubles the total write throughput available to the application, but also increases the probability of system failure, since the entire mount fails when any one of the striped EBS volumes fail.
A reasonable alternative is to use ZFS for striping, instead of mdadm/ext4. In fact, ZFS could have reduced the required write throughput even further by enabling compression in ZFS i.e. compressing blocks before sending them over to EBS. This requires CPU, and is perfect for applications that are I/O-bound and are not fully utilizing the CPU resources available on the hosts.