Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28462

Incremental backup can fail if log gets archived while WALPlayer is starting up

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • backup&restore
    • None

    Description

      We had incremental backup fail with FileNotFoundException for a file in the WALs directory. Upon investigation, the log had been archived a few mins earlier. WALInputFormat's record reader has support for falling back on an archived path:

      } catch (IOException e) {
        Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
        // archivedLog can be null if unable to locate in archiveDir.
        if (archivedLog != null) {
          openReader(archivedLog);
          // Try call again in recursion
          return nextKeyValue();
        } else {
          throw e;
        }
      } 

      But the getSplits method has different handling:

      try {
        List<FileStatus> files = getFiles(fs, inputPath, startTime, endTime);
        allFiles.addAll(files);
      } catch (FileNotFoundException e) {
        if (ignoreMissing) {
          LOG.warn("File " + inputPath + " is missing. Skipping it.");
          continue;
        }
        throw e;
      } 

      This ignoreMissing variable was added in HBASE-14141 and is enabled via 
      wal.input.ignore.missing.files which is defaulted to false and never set. Looking at the comment and reviewboard history of HBASE-14141 I think there might have been some confusion about where to handle these missing files, and this got lost in the shuffle.
       
      I would prefer not to ignore missing hfiles. I think that could result in some weird behavior:

      • RegionServer has 10 archived and 30 not-yet-archived WALs needing to be backed up
      • The process starts, and while it's running 1 of those 30 WALs gets archived. That would get skipped due to FileNotFoundException
      • But the remaining 29 would be backed up

      This scenario could cause some data consistency issues if this incremental backup is restored. We missed some edits in the middle of applied edits from other WALs.

      So I do think failing as we do today is necessary for consistency, but unrealistic in a live cluster. The solution is to try finding the missing file in the archived directory. Backups has a coprocessor which will not allow the archived file to be cleaned up until it's backed up, so I think it's safe to say that a WAL is either definitely in WALs or oldWALs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bbeaudreault Bryan Beaudreault
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: