Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28949

Kubernetes CGroup leaking leads to Spark Pods hang in Pending status

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.3, 2.4.4
    • None
    • Kubernetes, Spark Core

    Description

      After running Spark on k8s for a few days, some kubelet fails to create pod caused by warning message like

      \"mkdir /sys/fs/cgroup/memory/kubepods/burstable/podb4a04361-ca89-11e9-a224-6c92bf35392e/1d5aed3effffa20b246ec4f121f778f48c493e3e8678f2afe58a96c15180176e: no space left on device\"
      

      The k8s cluster and the kubelet node are free.

      These pods zombie over days before we manually notify and terminate them. Maybe it

      is a little bit easy to identify zombied driver pods, but it is quite inconvenient to identify executor pods when spark applications scale-out.

      This probably related to https://github.com/kubernetes/kubernetes/issues/70324

      Do we need a timeout, retry or failover mechanism for Spark to handle these kinds of k8s kernel issues?

       

       

      Attachments

        1. describe-driver-pod.txt
          4 kB
          Kent Yao 2
        2. describe-executor-pod.txt
          3 kB
          Kent Yao 2

        Activity

          People

            Unassigned Unassigned
            Qin Yao Kent Yao 2
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: