[SPARK-40379] Propagate decommission executor loss reason during onDisconnect in K8s - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: Kubernetes, Spark Core
Labels:
None

Description

Currently if an executor has been sent a decommission message and then it disconnects from the scheduler we only disable the executor depending on the K8s status events to drive the rest of the state transitions. However, the K8s status events can become overwhelmed on large clusters so we should check if an executor is in a decommissioning state when it is disconnected and use that reason instead of waiting on the K8s status events so we have more accurate logging information.

Attachments

Issue Links

links to

[Github] Pull Request #37821 (holdenk)

[Github] Pull Request #38907 (dongjoon-hyun)

Activity

People

Assignee:: Holden Karau

Reporter:: Holden Karau

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Sep/22 16:15

Updated:: 16/Dec/22 18:49

Resolved:: 05/Dec/22 03:48