Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26425

Add more constraint checks in file streaming source to avoid checkpoint corruption

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      Two issues observed in production.

      • HDFSMetadataLog.getLatest() tries to read older versions when it is not able to read the latest listed version file. Not sure why this was done but this should not be done. If the latest listed file is not readable, then something is horribly wrong and we should fail rather than report an older version as that can completely corrupt the checkpoint directory.
      • FileStreamSource should check whether adding the a new batch to the FileStreamSourceLog succeeded or not (similar to how StreamExecution checks for the OffsetSeqLog)

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              tdas Tathagata Das
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: