Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27579

CatalogJanitor can cause data loss due to errors during cleanMergeRegion

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.6.0, 3.0.0-alpha-4, 2.4.16, 2.5.3
    • None
    • None

    Description

      In CatalogJanitor.cleanMergeRegion, there is the following check:

      HRegionFileSystem regionFs = null;
      try {
        regionFs = HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), fs,
          tabledir, mergedRegion, true);
      } catch (IOException e) {
        LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
      }
      
      if (regionFs == null || !regionFs.hasReferences(htd)) {
       .. do the cleanup ..
      } 

       

      I think the assumption here is that an IOException would only be thrown if a region doesn't exist? We had a very poorly timed NameNode failover, during CatalogJanitor run, after a merge. The NameNode failover caused the openRegionFromFileSystem call to fail, which logged:

      WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region does not exist: 32c71224852c5a4b94a3ba271b4fcb15 

      This region did in fact exist and had not fully compacted, so there were still some lingering reference files.

      The cleanup process moves the parent regions to the archive directory, but the default TTL for those files in the archive directory is only 5 minutes. After that they are cleaned up and the data is now unrecoverable.

      This resulted in FileNotFoundExceptions trying to read or open this region. Our only course of action was to move the lingering reference files aside, so the data is unrecoverable.

      Attachments

        Issue Links

          Activity

            People

              bbeaudreault Bryan Beaudreault
              bbeaudreault Bryan Beaudreault
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: