Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34318

AdaptiveScheduler resource stabilisation should happen before the job is cancelled

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • Runtime / Coordination
    • None

    Description

      When a new resource requirement is submitted to the AdaptiveScheduler which increases the resource upper bound (max taskmanagers), when the first TaskManager comes up the job is immediately cancelled.

      Once the job is cancelled the scheduler waits for the entire stabilisation period to pass if it cannot acquire all resources before starting with the lower-than-requested parallelism.

      The problem here is that waiting for resource stabilisation happens after the job is cancelled, introducing unnecessary downtime for the job if the stabilisation period is large.

      We should change logic here to wait for the stabilisation period first to acquire all possible resources before cancelling the job.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              gyfora Gyula Fora
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: