Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34009

Apache flink: Checkpoint restoration issue on Application Mode of deployment

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.18.0
    • None
    • None
    • Flink version: 1.18

      Zookeeper version: 3.7.2

      Env: Custom flink docker image (with embedded application class) deployed over kubernetes (v1.26.11).

    Description

      Hi Team,

      Good Day. Wish you all a happy new year 2024.

      We are using Flink (1.18) version on our flink cluster. Job manager has been deployed on "Application mode" and HA is disabled (high-availability.type: NONE), under this configuration parameters we are able to start multiple jobs (using env.executeAsync()) of a single application.

      Note: We have also setup checkpoint on a s3 instance with RETAIN_ON_CANCELLATION mode (plus other required settings).

      Lets say now we start two jobs of the same application (ex: Jobidxxx1, jobidxxx2) and they are currently running on the k8s env. If we have to perform Flink minor upgrade (or) upgrade of our application with minor changes, in that case we will stop the Job Manager and Task Managers instances and perform the necessary up-gradation then when we start both Job Manager and Task Managers instance. On startup we expect the job's to be restored back from the last checkpoint, but the job restoration is not happening on Job manager startup. Please let us know if this is an bug (or) its the general behavior of flink under application mode of deployment.

      Additional information: If we enable HA (using Zookeeper) on Application mode, we are able to startup only one job (i.e., per-job behavior). When we perform Flink minor upgrade (or) upgrade of our application with minor changes, the checkpoint restoration is working properly on Job Manager & Task Managers restart process.

      It seems checkpoint restoration and HA are inter-related, but why checkpoint restoration doesn't work when HA is disabled.

       

      Please let us know if anyone has experienced similar issues or if have any suggestions, it will be highly appreciated. Thanks in advance for your assistance.

      Attachments

        Activity

          People

            Unassigned Unassigned
            vrangana@in.ibm.com Vijay
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: