Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.3, 2.4.4
-
None
Description
After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like
\"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\"
The k8s cluster and the kubelet node are free.
These pods zombie over days before we manually notify and terminate them. Maybe it
is a little bit easy to identify zombied driver pods, but it is quite inconvenient to identify executor pods when spark applications scale-out.
This probably related to https://github.com/kubernetes/kubernetes/issues/70324
Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues?