Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-15227

HBase Backup Phase 3: Fault tolerance (client/server) support

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      System must be tolerant to faults:

      1. Backup operations MUST be atomic (no partial completion state in the backup system table)
      2. Process must detect any type of failures which can result in a data loss (partial backup or partial restore)
      3. Proper system table state restore and cleanup must be done in case of a failure
      4. Additional utility to repair backup system table and corresponding file system cleanup must be implemented

      Backup

      General FT framework implementation

      Before actual backup operation starts, snapshot of a backup system table is taken and system table is updated with ACTIVE_SNAPSHOT flag. The flag will be removed upon backup completion.

      In case of any server-side failures, client catches errors/exceptions and handles them:

      1. Cleans up backup destination (removes partial backup data)
      2. Cleans up any temporary data
      3. Deletes any active snapshots of a tables being backed up (during full backup we snapshot tables)
      4. Restores backup system table from snapshot
      5. Deletes backup system table snapshot (we read snapshot name from backup system table before)

      In case of any client-side failures:

      Before any backup or restore operation run we check backup system table on ACTIVE_SNAPSHOT, if flag is present, operation aborts with a message that backup repair tool (see below) must be run

      Backup repair tool

      The command line tool backup repair which executes the following steps:

      1. Reads info of a last failed backup session
      2. Cleans up backup destination (removes partial backup data)
      3. Cleans up any temporary data
      4. Deletes any active snapshots of a tables being backed up (during full backup we snapshot tables)
      5. Restores backup system table from snapshot
      6. Deletes backup system table snapshot (we read snapshot name from backup system table before)

      Detection of a partial loss of data

      Full backup

      Export snapshot operation .

      We count files and check sizes before and after DistCp run

      Incremental backup

      Conversion of WAL to HFiles, when WAL file is moved from active to archive directory. The code is in place to handle this situation

      During DistCp run (same as above)

      Restore

      This operation does not modify backup system table and is idempotent. No special FT is required.

      Attachments

        1. HBASE-15227-v3.patch
          2 kB
          Vladimir Rodionov
        2. HBASE-15277-v1.patch
          2 kB
          Vladimir Rodionov

        Issue Links

          Activity

            People

              vrodionov Vladimir Rodionov
              vrodionov Vladimir Rodionov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: