Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27348

HeartbeatReceiver doesn't remove lost executors from CoarseGrainedSchedulerBackend

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • Spark Core
    • None

    Description

      When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a connection of an executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this lost executor. This task will never finish and the job will hang forever.

      Attachments

        Issue Links

          Activity

            People

              Ngone51 wuyi
              zsxwing Shixiong Zhu
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: