Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-35217

Missing fsync in FileSystemCheckpointStorage

    XMLWordPrintableJSON

Details

    Description

      While running Flink on a system with unstable power supply checkpoints were regularly corrupted in the form of "_metadata" files with a file size of 0 bytes. In all cases the previous checkpoint data had already been deleted, causing progress to be lost completely.

      Further investigation revealed that the "FileSystemCheckpointStorage" doesn't perform "fsync" when writing a new checkpoint to disk. This means the old checkpoint gets removed without making sure that the new one is durably persisted on disk. "strace" on the jobmanager's process confirms this behavior:

      1. The checkpoint chk-60's in-progress metadata is written at "openat"
      2. The checkpoint chk-60's in-progress metadata is atomically renamed at "rename"
      3. The old checkpoint chk-59 is deleted at "unlink"

      For durable persistence an "fsync" call is missing before step 3.
      Full "strace" log:

      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60", 0x7fd2ad5fc970) = -1 ENOENT (No such file or directory)
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60", 0x7fd2ad5fca00) = -1 ENOENT (No such file or directory)
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc", {st_mode=S_IFDIR|0755, st_size=42, ...}) = 0
      [pid 51618] 11:44:30 mkdir("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60", 0777) = 0
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/_metadata", 0x7fd2ad5fc860) = -1 ENOENT (No such file or directory)
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/_metadata", 0x7fd2ad5fc740) = -1 ENOENT (No such file or directory)
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/._metadata.inprogress.bf9518dc-2100-4524-9e67-e42913c2b8e8", 0x7fd2ad5fc7d0) = -1 ENOENT (No such file or directory)
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 51618] 11:44:30 openat(AT_FDCWD, "/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/._metadata.inprogress.bf9518dc-2100-4524-9e67-e42913c2b8e8", O_WRONLY|O_CREAT|O_EXCL, 0666) = 168
      [pid 51618] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/._metadata.inprogress.bf9518dc-2100-4524-9e67-e42913c2b8e8", {st_mode=S_IFREG|0644, st_size=23378, ...}) = 0
      [pid 51618] 11:44:30 rename("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/._metadata.inprogress.bf9518dc-2100-4524-9e67-e42913c2b8e8", "/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-60/_metadata") = 0
      [pid 51644] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59/_metadata", {st_mode=S_IFREG|0644, st_size=23378, ...}) = 0
      [pid 51644] 11:44:30 unlink("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59/_metadata") = 0
      [pid 51644] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 51644] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 51644] 11:44:30 openat(AT_FDCWD, "/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 168
      [pid 51644] 11:44:30 newfstatat(168, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
      [pid 51644] 11:44:30 stat("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 51644] 11:44:30 openat(AT_FDCWD, "/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 168
      [pid 51644] 11:44:30 newfstatat(168, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
      [pid 51644] 11:44:30 unlink("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59") = -1 EISDIR (Is a directory)
      [pid 51644] 11:44:30 rmdir("/opt/flink/statestore/e1c541c4568515e77df32d82727e20dc/chk-59") = 0 

      To fix this I'm currently testing the following commit: https://github.com/Planet-X/flink/commit/24196cc897533b654f44e2b612543ff023cdb123

      "strace" can confirm that "fsync" is now called before the previous checkpoint is removed at "unlink":

      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50",  <unfinished ...>
      [pid 40393] 11:30:17 <... stat resumed>0x7fc887efc970) = -1 ENOENT (No such file or directory)
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50", 0x7fc887efca00) = -1 ENOENT (No such file or directory)
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e", {st_mode=S_IFDIR|0755, st_size=42, ...}) = 0
      [pid 40393] 11:30:17 mkdir("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50", 0777) = 0
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/_metadata", 0x7fc887efc870) = -1 ENOENT (No such file or directory)
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/_metadata", 0x7fc887efc750) = -1 ENOENT (No such file or directory)
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/._metadata.inprogress.24b0ea02-a05c-4297-89ff-08340e8cfa90", 0x7fc887efc7e0) = -1 ENOENT (No such file or directory)
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 40393] 11:30:17 openat(AT_FDCWD, "/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/._metadata.inprogress.24b0ea02-a05c-4297-89ff-08340e8cfa90", O_WRONLY|O_CREAT|O_EXCL, 0666) = 194
      [pid 40393] 11:30:17 fsync(194)         = 0
      [pid 40393] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/._metadata.inprogress.24b0ea02-a05c-4297-89ff-08340e8cfa90", {st_mode=S_IFREG|0644, st_size=23366, ...}) = 0
      [pid 40393] 11:30:17 rename("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/._metadata.inprogress.24b0ea02-a05c-4297-89ff-08340e8cfa90", "/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-50/_metadata") = 0
      [pid 39230] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49/_metadata", {st_mode=S_IFREG|0644, st_size=23366, ...}) = 0
      [pid 39230] 11:30:17 unlink("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49/_metadata") = 0
      [pid 39230] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 39230] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 39230] 11:30:17 openat(AT_FDCWD, "/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 194
      [pid 39230] 11:30:17 newfstatat(194, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
      [pid 39230] 11:30:17 stat("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
      [pid 39230] 11:30:17 openat(AT_FDCWD, "/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 194
      [pid 39230] 11:30:17 newfstatat(194, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
      [pid 39230] 11:30:17 unlink("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49") = -1 EISDIR (Is a directory)
      [pid 39230] 11:30:17 rmdir("/opt/flink/statestore/28be342d7d6b7cfd8883799cab99576e/chk-49") = 0 

      Attachments

        Activity

          People

            srichter Stefan Richter
            planet9 Marc Aurel Fritz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: