[HBASE-27707] Region replica replication sometimes orphans WAL queue entries during recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 2.5.0
Fix Version/s: None
Component/s: read replicas, Replication
Labels:
None

Description

Running with timeline-consistent read replicas and hbase.region.replica.replication.enabled=true, we're seeing some region servers have WAL queue entires that never clear. This appears to correlate with SCP and recovery of replication queues. The result is WALs that build up, consuming dangerous amounts of space on HDFS. Remediation requires disabling and removing the region_replica_replication peer, which forces an impacted region server to abort with the message "Failed to operate on replication queue". We then delete the zk entry, which unlocks the WAL and the cleaner chore can sweep them.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nick Dimiduk

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Mar/23 12:36

Updated:: 24/Mar/23 15:05