Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.0
-
None
Description
When a heartbeat timeout happens in HeartbeatReceiver, it doesn't remove lost executors from CoarseGrainedSchedulerBackend. When a connection of an executor is not gracefully shut down, CoarseGrainedSchedulerBackend may not receive a disconnect event. In this case, CoarseGrainedSchedulerBackend still thinks a lost executor is still alive. CoarseGrainedSchedulerBackend may ask TaskScheduler to run tasks on this lost executor. This task will never finish and the job will hang forever.
Attachments
Issue Links
- is duplicated by
-
SPARK-30297 Executor heartbeat expired cause app hung up forever
- Resolved
- links to