There appears to be a race condition between close and split which when combined with a side effect of
HBASE-20704, leads to the parent region store files getting archived and cleared while daughter regions still have references to those parent region store files.
Here is the timeline of events observed for an affected region:
- RS1 faces ZooKeeper connectivity issue for master node and starts shutting itself down. As part of this it starts to close the store and clean up the compacted files (File A)
- Master starts bulk assigning regions and assign parent region to RS2
- Region opens on RS2 and ends up opening compacted store file(s) (suspect this is due to
- Now split happens and daughter regions open on RS2 and try to run a compaction as part of post open
- Split request at this point is complete. However now archiving proceeds on RS1 and ends up archiving the store file that is referenced by the daughter.
Compaction fails due to FileNotFoundException and all subsequent attempts to open the region will fail until manual resolution.
We think having
HBASE-20724 would help in such situations since we won't end up loading compacted store files in the first place.