Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41550 Dynamic Allocation on K8S GA
  3. SPARK-40979

Keep removed executor info in decommission state

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Spark Core
    • None

    Description

      Removed executor due to decommission should be kept in a separate set. To avoid OOM, set size will be limited to 1K or 10K.

      FetchFailed caused by decom executor could be divided into 2 categories:

      1. When FetchFailed reached DAGScheduler, the executor is still alive or is lost but the lost info hasn't reached TaskSchedulerImpl. This is already handled inĀ SPARK-40979
      2. FetchFailed is caused by decom executor loss, so the decom info is already removed in TaskSchedulerImpl. If we keep such info in a short period, it is good enough. Even we limit the size of removed executors to 10K, it could be only at most 10MB memory usage. In real case, it's rare to have cluster size of over 10K and the chance that all these executors decomed and lost at the same time would be small.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            warrenzhu25 Zhongwei Zhu
            warrenzhu25 Zhongwei Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment