Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44609

ExecutorPodsAllocator doesn't create new executors if no pod snapshot captured pod creation

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There’s a following race condition in ExecutorPodsAllocator when running a spark application with static allocation on kubernetes with numExecutors >= 1:

      • Driver requests an executor
      • exec-1 gets created and registers with driver
      • exec-1 is moved from newlyCreatedExecutors to schedulerKnownNewlyCreatedExecs
      • exec-1 got deleted very quickly (~1-30 sec) after registration
      • ExecutorPodsWatchSnapshotSource fails to catch the creation of the pod (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
      • ExecutorPodsPollingSnapshotSource fails to catch the creation because it runs every 30 secs, but executor was removed much quicker after creation
      • exec-1 is never removed from schedulerKnownNewlyCreatedExecs
      • ExecutorPodsAllocator will never request new executor because it’s slot is occupied by exec-1, due to schedulerKnownNewlyCreatedExecs never being cleared.

       

      Put up a fix here https://github.com/apache/spark/pull/42297

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            alibiyeslambek Alibi Yeslambek

            Dates

              Created:
              Updated:

              Slack

                Issue deployment