Uploaded image for project: 'Jackrabbit Oak'
  1. Jackrabbit Oak
  2. OAK-9785

Tar SegmentStore can be corrupted during compaction

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.42.0
    • 1.46.0, 1.22.14
    • segment-tar
    • None

    Description

      There is a scenario where a segment store can become corrupted, leading to SegmentNotFoundExceptions with very "young" SegmentIds, i.e. in the 1-2 digit millisecond range. E.g. SegmentId age=2ms.

      The scenario I observed looks as follows:

      • a blob is "lost" from the external blob store (presumably due to incorrect cloning of the instance, most likely only happens with unfortunate timing)
      • a tail revision GC run is performed (not sure if it matters that this was a tail compaction)
        • the missing blob is encountered during compaction
        • an exception other than an IOException (IIRC it was a IllegalArgumentException) is thrown due to the missing blob
        • revision GC fails WITHOUT properly being aborted, and thus the partially written revision of the compaction run is not removed
      • more data is written on the instance
      • a full revision GC run is performed
        • a referenced segment is removed due to the incorrect/confused revision data
      • the SegmentNotFoundException is first observed either during the remainder of the compaction run or when the respective node is requested the next time, usually during a traversal

      The root cause is in AbstractCompactionStrategy, where only IOExceptions are caught.

      In order to improve the robustness of the code, I think we need to catch all Throwables. Otherwise we cannot guarantee that compaction is correctly aborted.

      Attachments

        1. error.log.2022-06-09
          47 kB
          Julian Sedding

        Activity

          People

            jsedding Julian Sedding
            jsedding Julian Sedding
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: