Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23827

StreamingJoinExec should ensure that input data is partitioned into specific number of partitions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1, 2.4.0
    • Structured Streaming
    • None

    Description

      Currently, the requiredChildDistribution does not specify the partitions. This can cause the weird corner cases where the child's distribution is `SinglePartition` which satisfies the required distribution of `ClusterDistribution(no-num-partition-requirement)`, thus eliminating the shuffle needed to repartition input data into the required number of partitions (i.e. same as state stores). That can lead to "file not found" errors on the state store delta files as the micro-batch-with-no-shuffle will not run certain tasks and therefore not generate the expected state store delta files.

      Attachments

        Activity

          People

            tdas Tathagata Das
            tdas Tathagata Das
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: