Details
-
Bug
-
Status: Open
-
Critical
-
Resolution: Unresolved
-
2.5.0
-
None
-
None
Description
Running with timeline-consistent read replicas and hbase.region.replica.replication.enabled=true, we're seeing some region servers have WAL queue entires that never clear. This appears to correlate with SCP and recovery of replication queues. The result is WALs that build up, consuming dangerous amounts of space on HDFS. Remediation requires disabling and removing the region_replica_replication peer, which forces an impacted region server to abort with the message "Failed to operate on replication queue". We then delete the zk entry, which unlocks the WAL and the cleaner chore can sweep them.