[SPARK-28949] Kubernetes CGroup leaking leads to Spark Pods hang in Pending status - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.3, 2.4.4
Fix Version/s: None
Component/s: Kubernetes, Spark Core
Labels:
- bulk-closed

Description

After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like

\"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\"

The k8s cluster and the kubelet node are free.

These pods zombie over days before we manually notify and terminate them. Maybe it

is a little bit easy to identify zombied driver pods, but it is quite inconvenient to identify executor pods when spark applications scale-out.

This probably related to https://github.com/kubernetes/kubernetes/issues/70324

Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

describe-driver-pod.txt
02/Sep/19 10:18
4 kB
Kent Yao 2
describe-executor-pod.txt
02/Sep/19 10:18
3 kB
Kent Yao 2

Issue Links

links to

Cgroup leaking, no space left on /sys/fs/cgroup

Kernel not freeing memory cgroup causing no space left on device

Activity

People

Assignee:: Unassigned

Reporter:: Kent Yao 2

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Sep/19 10:17

Updated:: 25/May/21 01:53

Resolved:: 25/May/21 01:38