Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21387

Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

    XMLWordPrintableJSON

Details

    • Reviewed
    • Hide
      To prevent race condition between in progress snapshot (performed by TakeSnapshotHandler) and HFileCleaner which results in data loss, this JIRA introduced mutual exclusion between taking snapshot and running HFileCleaner. That is, at any given moment, either some snapshot can be taken or, HFileCleaner checks hfiles which are not referenced, but not both can be running.
      Show
      To prevent race condition between in progress snapshot (performed by TakeSnapshotHandler) and HFileCleaner which results in data loss, this JIRA introduced mutual exclusion between taking snapshot and running HFileCleaner. That is, at any given moment, either some snapshot can be taken or, HFileCleaner checks hfiles which are not referenced, but not both can be running.

    Description

      During recent report from customer where ExportSnapshot failed:

      2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] snapshot.SnapshotReferenceUtil: Can't find hfile: 44f6c3c646e84de6a63fe30da4fcb3aa in the real (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) or archive (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) directory for the primary table. 
      

      We found the following in log:

      2018-10-09 18:54:23,675 DEBUG [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] cleaner.HFileCleaner: Removing: hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa from archive
      

      The root cause is race condition surrounding in progress snapshot(s) handling between refreshCache() and getUnreferencedFiles().
      There are two callers of refreshCache: one from RefreshCacheTask#run and the other from SnapshotHFileCleaner.

      Let's look at the code of refreshCache:

            if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
      

      whose intention is to exclude in progress snapshot(s).
      Suppose when the RefreshCacheTask runs refreshCache, there is some in progress snapshot (about to finish).

      When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that lastModifiedTime is up to date. So cleaner proceeds to check in progress snapshot(s). However, the snapshot has completed by that time, resulting in some file(s) deemed unreferenced.

      Here is timeline given by Josh illustrating the scenario:

      At time T0, we are checking if F1 is referenced. At time T1, there is a snapshot S1 in progress that is referencing a file F1. refreshCache() is called, but no completed snapshot references F1. At T2, the snapshot S1, which references F1, completes. At T3, we check in-progress snapshots and S1 is not included. Thus, F1 is marked as unreferenced even though S1 references it.

      Attachments

        1. 0001-UT.patch
          6 kB
          Zheng Hu
        2. 21387.addendum.txt
          0.9 kB
          Ted Yu
        3. 21387.dbg.txt
          2 kB
          Ted Yu
        4. 21387.v10.txt
          4 kB
          Ted Yu
        5. 21387.v11.txt
          4 kB
          Ted Yu
        6. 21387.v12.txt
          4 kB
          Ted Yu
        7. 21387.v2.txt
          1.0 kB
          Ted Yu
        8. 21387.v3.txt
          3 kB
          Ted Yu
        9. 21387.v6.txt
          8 kB
          Ted Yu
        10. 21387.v7.txt
          8 kB
          Ted Yu
        11. 21387.v8.txt
          9 kB
          Ted Yu
        12. 21387.v9.txt
          4 kB
          Ted Yu
        13. 21387-suggest.txt
          9 kB
          Ted Yu
        14. 21511.v2.txt
          12 kB
          Ted Yu
        15. HBASE-21387.branch-1.2.patch
          41 kB
          Zheng Hu
        16. HBASE-21387.branch-1.3.patch
          37 kB
          Zheng Hu
        17. HBASE-21387.branch-1.patch
          23 kB
          Zheng Hu
        18. HBASE-21387.v13.patch
          13 kB
          Zheng Hu
        19. HBASE-21387.v14.patch
          14 kB
          Zheng Hu
        20. HBASE-21387.v15.patch
          23 kB
          Zheng Hu
        21. HBASE-21387.v16.patch
          22 kB
          Zheng Hu
        22. HBASE-21387.v17.patch
          22 kB
          Zheng Hu
        23. two-pass-cleaner.v4.txt
          22 kB
          Ted Yu
        24. two-pass-cleaner.v6.txt
          32 kB
          Ted Yu
        25. two-pass-cleaner.v9.txt
          37 kB
          Ted Yu

        Issue Links

          Activity

            People

              yuzhihong@gmail.com Ted Yu
              yuzhihong@gmail.com Ted Yu
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: