Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41162

Anti-join must not be pushed below aggregation with ambiguous predicates

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.3, 3.1.3, 3.3.1, 3.2.3, 3.4.0
    • 3.2.4, 3.3.2, 3.4.0
    • SQL

    Description

      The following query should return a single row as all values for id except for the largest will be eliminated by the anti-join:

      val ids = Seq(1, 2, 3).toDF("id").distinct()
      val result = ids.withColumn("id", $"id" + 1).join(ids, Seq("id"), "left_anti").collect()
      assert(result.length == 1)
      

      Without the distinct(), the assertion is true. With distinct(), the assertion should still hold but is false.

      Rule PushDownLeftSemiAntiJoin pushes the Join below the left Aggregate with join condition (id#750 + 1) = id#750, which can never be true.

      === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
      !Join LeftAnti, (id#752 = id#750)                  'Aggregate [id#750], [(id#750 + 1) AS id#752]
      !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, ((id#750 + 1) = id#750)
      !:  +- LocalRelation [id#750]                         :- LocalRelation [id#750]
      !+- Aggregate [id#750], [id#750]                      +- Aggregate [id#750], [id#750]
      !   +- LocalRelation [id#750]                            +- LocalRelation [id#750]
      

      The optimizer then rightly removes the left-anti join altogether, returning the left child only.

      Rule PushDownLeftSemiAntiJoin should not push down predicates that reference left and right child.

      Attachments

        Activity

          People

            enricomi Enrico Minack
            enricomi Enrico Minack
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: