Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44609

ExecutorPodsAllocator doesn't create new executors if no pod snapshot captured pod creation

    XMLWordPrintableJSON

Details

    Description

      There’s a following race condition in ExecutorPodsAllocator when running a spark application with static allocation on kubernetes with numExecutors >= 1:

      • Driver requests an executor
      • exec-1 gets created and registers with driver
      • exec-1 is moved from newlyCreatedExecutors to schedulerKnownNewlyCreatedExecs
      • exec-1 got deleted very quickly (~1-30 sec) after registration
      • ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
      • ExecutorPodsPollingSnapshotSource fails to catch the creation because it runs every 30 secs, but executor was removed much quicker after creation
      • exec-1 is never removed from schedulerKnownNewlyCreatedExecs
      • ExecutorPodsAllocator will never request new executor because it’s slot is occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.

       

      Put up a fix here https://github.com/apache/spark/pull/42297

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              alibiyeslambek Alibi Yeslambek
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: