[KAFKA-16054] Sudden 100% CPU on a broker - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.3.2, 3.6.1
Fix Version/s: None
Component/s: network
Labels:
- linux
Environment:
Amazon AWS, c6g.4xlarge arm64 16 vCPUs + 30 GB, Amazon Linux

Description

We have observed now for the 3rd time in production the issue where a Kafka broker will suddenly jump to 100% CPU usage and will not recover on its own until manually restarted.

After a deeper investigation, we now believe that this is an instance of the infamous epoll bug. See:
https://github.com/netty/netty/issues/327
https://github.com/netty/netty/pull/565 (original workaround)
https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L624-L632 (same workaround in the current Netty code)

The first occurrence in our production environment was on 2023-08-26 and the other two — on 2023-12-10 and 2023-12-20.

Each time the high CPU issue is also resulting in this other issue (misplaced messages) I was asking about on the users mailing list in September, but to date got not a single reply, unfortunately: https://lists.apache.org/thread/x1thr4r0vbzjzq5sokqgrxqpsbnnd3yy

We still do not know how this other issue is happening.

When the high CPU happens, top(1) reports a number of "data-plane-kafka..." threads consuming ~60% user and ~40% system CPU, and the thread dump contains a lot of stack traces like the following one:

"data-plane-kafka-network-thread-67111914-ListenerName(PLAINTEXT)-PLAINTEXT-10" #76 prio=5 os_prio=0 cpu=346710.78ms elapsed=243315.54s tid=0x0000ffffa12d7690 nid=0x20c runnable [0x0000fffed87fe000]
java.lang.Thread.State: RUNNABLE
#011at sun.nio.ch.EPoll.wait(java.base@17.0.9/Native Method)
#011at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@17.0.9/EPollSelectorImpl.java:118)
#011at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@17.0.9/SelectorImpl.java:129)
#011- locked <0x00000006c1246410> (a sun.nio.ch.Util$2)
#011- locked <0x00000006c1246318> (a sun.nio.ch.EPollSelectorImpl)
#011at sun.nio.ch.SelectorImpl.select(java.base@17.0.9/SelectorImpl.java:141)
#011at org.apache.kafka.common.network.Selector.select(Selector.java:874)
#011at org.apache.kafka.common.network.Selector.poll(Selector.java:465)
#011at kafka.network.Processor.poll(SocketServer.scala:1107)
#011at kafka.network.Processor.run(SocketServer.scala:1011)
#011at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

At the same time the Linux kernel reports repeatedly "TCP: out of memory – consider tuning tcp_mem".

We are running relatively big machines in production — c6g.4xlarge with 30 GB RAM and the auto-configured setting is: "net.ipv4.tcp_mem = 376608 502145 753216", which corresponds to ~3 GB for the "high" parameter, assuming 4 KB memory pages.

We were able to reproduce the issue in our test environment (which is using 4x smaller machines), simply by tuning the tcp_mem down by a factor of 10: "sudo sysctl -w net.ipv4.tcp_mem='9234 12313 18469'". The strace of one of the busy Kafka threads shows the following syscalls repeating constantly:

epoll_pwait(15558, \\\\\{events=EPOLLOUT, data={u32=12286, u64=4681111381628432382}}, 1024, 300, NULL, 8) = 1
fstat(12019,{st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
fstat(12019, {st_mode=S_IFREG|0644, st_size=414428357, ...}) = 0
sendfile(12286, 12019, [174899834], 947517) = -1 EAGAIN (Resource temporarily unavailable)

Resetting the "tcp_mem" parameters back to the auto-configured values in the test environment removes the pressure from the broker and it can continue normally without restart.

We have found a bug report here that suggests that an issue may be partially due to a kernel bug: https://bugs.launchpad.net/ubuntu/+source/linux-meta-aws-6.2/+bug/2037335 (they are using version 5.15)

We have updated our kernel from 6.1.29 to 6.1.66 and that made it harder to reproduce the issue, but we can still do it by reducing all the of "tcp_mem" parameters by a factor of 1,000. The JVM behavior is the same under these conditions.

A similar issue is reported here, affecting Kafka Connect:
https://issues.apache.org/jira/browse/KAFKA-4739

Our production Kafka is running version 3.3.2, and test — 3.6.1. The issue is present on both systems.

The issue is also reproducible on JDK 11 (as you can see from the stack trace, we are using 17).

Sudden 100% CPU on a broker

Details

Description

Attachments

Activity

People

Dates