[FLINK-34318] AdaptiveScheduler resource stabilisation should happen before the job is cancelled - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

When a new resource requirement is submitted to the AdaptiveScheduler which increases the resource upper bound (max taskmanagers), when the first TaskManager comes up the job is immediately cancelled.

Once the job is cancelled the scheduler waits for the entire stabilisation period to pass if it cannot acquire all resources before starting with the lower-than-requested parallelism.

The problem here is that waiting for resource stabilisation happens after the job is cancelled, introducing unnecessary downtime for the job if the stabilisation period is large.

We should change logic here to wait for the stabilisation period first to acquire all possible resources before cancelling the job.

Attachments

Issue Links

duplicates

FLINK-33092 Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler

Open

Activity

People

Assignee:: Unassigned

Reporter:: Gyula Fora

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 31/Jan/24 09:48

Updated:: 31/Jan/24 10:06

Resolved:: 31/Jan/24 10:06