Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.42.0
-
None
Description
There is a scenario where a segment store can become corrupted, leading to SegmentNotFoundExceptions with very "young" SegmentIds, i.e. in the 1-2 digit millisecond range. E.g. SegmentId age=2ms.
The scenario I observed looks as follows:
- a blob is "lost" from the external blob store (presumably due to incorrect cloning of the instance, most likely only happens with unfortunate timing)
- a tail revision GC run is performed (not sure if it matters that this was a tail compaction)
- the missing blob is encountered during compaction
- an exception other than an IOException (IIRC it was a IllegalArgumentException) is thrown due to the missing blob
- revision GC fails WITHOUT properly being aborted, and thus the partially written revision of the compaction run is not removed
- more data is written on the instance
- a full revision GC run is performed
- a referenced segment is removed due to the incorrect/confused revision data
- the SegmentNotFoundException is first observed either during the remainder of the compaction run or when the respective node is requested the next time, usually during a traversal
The root cause is in AbstractCompactionStrategy, where only IOExceptions are caught.
In order to improve the robustness of the code, I think we need to catch all Throwables. Otherwise we cannot guarantee that compaction is correctly aborted.