Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
When a new resource requirement is submitted to the AdaptiveScheduler which increases the resource upper bound (max taskmanagers), when the first TaskManager comes up the job is immediately cancelled.
Once the job is cancelled the scheduler waits for the entire stabilisation period to pass if it cannot acquire all resources before starting with the lower-than-requested parallelism.
The problem here is that waiting for resource stabilisation happens after the job is cancelled, introducing unnecessary downtime for the job if the stabilisation period is large.
We should change logic here to wait for the stabilisation period first to acquire all possible resources before cancelling the job.
Attachments
Issue Links
- duplicates
-
FLINK-33092 Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler
- Open