Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29314

ProgressReporter.extractStateOperatorMetrics should not overwrite updated as 0 when it actually runs a batch even with no data

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4.4, 3.0.0
    • Fix Version/s: None
    • Component/s: Structured Streaming
    • Labels:
      None

      Description

      SPARK-24156 brought the ability to run a batch without actual data to enable fast state cleanup as well as emit evicted outputs without waiting actual data to come.

      This breaks some assumption on `ProgressReporter.extractStateOperatorMetrics`. See comment in source code:

      // lastExecution could belong to one of the previous triggers if `!hasNewData`.
      // Walking the plan again should be inexpensive.
      

      and newNumRowsUpdated is replaced to 0 if hasNewData is false. It makes sense if we copy progress from previous execution (which means no batch is run for this time), but after SPARK-24156 the precondition is broken.

      Spark should still replace the value of newNumRowsUpdated with 0 if there's no batch being run and it needs to copy the old value from previous execution, but it shouldn't touch the value if it runs a batch for no data.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kabhwan Jungtaek Lim
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: