Hazelcast on AWS – Kernel Bug
The other day we discovered a bug in older Linux kernel versions and I thought you would like to know about it.
Summary
Use Linux kernel 3.19+ when running Hazelcast on AWS. TCP connections can get stuck with older kernel versions. They appear to be fine, but data are not flowing. This can result in hard-to-explain timeouts.
The Gory Details
Read this: http://johnjianfang.blogspot.bg/2015/08/xen-driver-bug-xennetfront-xennet-skb.html
tl;dr: There is a bug in Xen network driver and AWS happens to use Xen for virtualization. A workaround is to disable (sudo ethtool -K eth0 sg off) the buggy features, but it comes with a performance price and it’s better to use a kernel version with fix = 3.19+. I assume Linux distribution vendors back-ported the fix to older kernel versions, but I have not checked that.
Further Sources:
- https://blog.stathat.com/2014/12/22/fix_ec2_network_issue_skb_rides_the_rocket.html
- https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1317811
- http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html
Kernel Patches: