[YUNIKORN-588] Placeholder pods are not cleaned up timely when the Spark driver fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.10
Fix Version/s: None
Component/s: shim - kubernetes
Labels:
- spark

Description

When a Spark job is gang scheduled, if the driver pod fails immediately upon running (e.g. due to an error in the Spark application code), the placeholder pods will still try to reserve resources. They won't be terminated until after the configured timeout has passed, even though they should have been cleaned up the moment that the driver failed. Because we already knew at that point, none of the executors would have a chance to start.
Something probably needs to be done at the Spark operator plugin level to activate placeholder cleanup to release resources sooner.

Edit: Actually a fix needs to be developed without the Spark operator plugin because the user might not be using it. The Spark job could well have been submitted via spark-submit.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2021-03-19 at 9.41.48 PM.png
20/Mar/21 15:28
550 kB
Chaoran Yu

Activity

People

Assignee:: Unassigned

Reporter:: Chaoran Yu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Mar/21 01:57

Updated:: 04/Aug/22 08:10