Uncovering performance regressions in the TCP SACKs vulnerability fixes
databricks.comThis is a fun write up to read, because I experienced exactly the same problem in a similar system dealing with lots of writes to S3. Sporadic 15 minute timeouts for no immediate reason, especially as the files were only a few megabytes in size.
It led to me going on a similar journey of diving deep into the stack right into doing diffs on the kernel tree to work out what had changed between kernel versions. Eventually I came to the same conclusion, and only recently the problem has been patched in CENTOS6/RHEL6 https://access.redhat.com/errata/RHSA-2019:2736
Interestingly after identifying this problem, I also noticed similar behaviour on AWS Lambda shortly after June 20th, with the TCPWQueueTooBig metric spiking and causing Lambda timeouts. Took a few rounds through AWS support (and our account managers) to get them to look at it, but they eventually fixed it.
I think the common thread between this post and my experience is we are both using a Java/JVM based stack. When trying to reproduce the bug for Amazon I could only reproduce it with a simple Java example, whereas my attempt with Golang seemed to run fine - so not really sure why that was.
Maybe I'll write a similar blog about my findings, at least I learnt a lot from it!
Chris from Databricks here.
Glad you enjoyed the write up and glad to hear we aren’t alone.
We also had difficulty creating a repro outside of Spark (JVM). I tried with Python sockets without any luck. That said, hitting the issue requires the right mix of dropped packets, socket buffer sizes and MSS. I don’t think there is anything special about the JVM influencing those variables. Now that I know more, maybe I can craft a minimal repro in another language.
A datapoint I didn’t mention in the post is that we had a significantly higher repro rate when talking to S3 through a VPC endpoint. The only difference I could see was that the VPC endpoint connections had an MSS of 1412, while the MSS was slightly higher (1436 IIRC) on non-VPC connections. Yet to draw conclusions from that.
I'd be very interested in if you find something specific. Do you have any details regarding the AWS contacts because i'd like to see if i can generalize it for the RHEL6 case and maybe feed it back into engineering.
wmealing @ redhat dawt com.