Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35178

Checkpoint CLAIM mode does not fully control snapshot ownership

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.18.0
    • None
    • None

    Description

      When I enable incremental checkpointing, and the task fails or is canceled for some reason, restarting the task from -s checkpoint_path with restoreMode CLAIM allows the Flink job to recover from the last checkpoint, it just discards the previous checkpoint.

      Then I found that this leads to the following two cases:

      1. If the new checkpoint_x meta file does not reference files in the shared directory under the previous jobID:         

      the shared and taskowned directories from the previous Job will be left as empty directories, and these two directories will persist without being deleted by Flink.

      2. If the new checkpoint_x meta file references files in the shared directory under the previous jobID:

      the chk-(x-1) from the previous job will be discarded, but there will still be state data in the shared directory under that job, which might persist for a relatively long time. Here arises the question: the previous job is no longer running, and it's unclear whether users should delete the state data. Deleting it could lead to errors when the task is restarted, as the meta might reference files that can no longer be found; this could be confusing for users.

       

      The potential solution might be to reuse the previous job's jobID when restoring from -s checkpoint_path, or to add a new parameter that allows users to specify the jobID they want to recover from;

       

      Please correct me if there's anything I've misunderstood.

      Attachments

        Activity

          People

            Unassigned Unassigned
            elon elon_X
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: