Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29314

ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.4, 3.0.0
    • 3.0.0
    • Structured Streaming
    • None

    Description

      SPARK-24156 brought the ability to run a batch without actual data to enable fast state cleanup as well as emit evicted outputs without waiting actual data to come.

      This breaks some assumption on `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:

      // lastExecution could belong to one of the previous triggers if `!hasNewData`.
      // Walking the plan again should be inexpensive.
      

      and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense if we copy progress from previous execution (which means no batch is run for this time), but after SPARK-24156 the precondition is broken.

      Spark should still replace the value of newNumRowsUpdated with 0 if there's no batch being run and it needs to copy the old value from previous execution, but it shouldn't touch the value if it runs a batch for no data.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              kabhwan Jungtaek Lim
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: