Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30294

Read-only state store unnecessarily creates and deletes the temp file for delta file every batch

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.1.0
    • 3.1.0
    • Structured Streaming
    • None

    Description

      https://github.com/apache/spark/blob/d38f8167483d4d79e8360f24a8c0bffd51460659/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala#L143-L155

          /** Abort all the updates made on this store. This store will not be usable any more. */
          override def abort(): Unit = {
            // This if statement is to ensure that files are deleted only if there are changes to the
            // StateStore. We have two StateStores for each task, one which is used only for reading, and
            // the other used for read+write. We don't want the read-only to delete state files.
            if (state == UPDATING) {
              state = ABORTED
              cancelDeltaFile(compressedStream, deltaFileStream)
            } else {
              state = ABORTED
            }
            logInfo(s"Aborted version $newVersion for $this")
          } 

      Despite of the comment, read-only state store also does the same things for preparing write - creates the temporary file, initializes output streams for the file, closes these output streams, and deletes the temporary file. That is just unnecessary and gives confusion as according to the log messages two different instances seem to write to same delta file.

       

      Attachments

        Activity

          People

            kabhwan Jungtaek Lim
            kabhwan Jungtaek Lim
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: