Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27707

Region replica replication sometimes orphans WAL queue entries during recovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.5.0
    • None
    • read replicas, Replication
    • None

    Description

      Running with timeline-consistent read replicas and hbase.region.replica.replication.enabled=true, we're seeing some region servers have WAL queue entires that never clear. This appears to correlate with SCP and recovery of replication queues. The result is WALs that build up, consuming dangerous amounts of space on HDFS. Remediation requires disabling and removing the region_replica_replication peer, which forces an impacted region server to abort with the message "Failed to operate on replication queue". We then delete the zk entry, which unlocks the WAL and the cleaner chore can sweep them.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ndimiduk Nick Dimiduk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: