Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21479

Outer join filter pushdown in null supplying table when condition is on one of the joined columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0, 2.1.1, 2.2.0
    • 2.4.0
    • Optimizer, SQL
    • None

    Description

      Here are two different query plans -

      df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
      df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 5, "c" : 8}])
      
      df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
      
      == Physical Plan ==
      *Project [a#16299L, b#16295L, c#16300L]
      +- *SortMergeJoin [a#16294L], [a#16299L], Inner
         :- *Sort [a#16294L ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(a#16294L, 4)
         :     +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && isnotnull(a#16294L))
         :        +- Scan ExistingRDD[a#16294L,b#16295L]
         +- *Sort [a#16299L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(a#16299L, 4)
               +- *Filter isnotnull(a#16299L)
                  +- Scan ExistingRDD[a#16299L,c#16300L]
      
      
      df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
      df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 5, "c" : 8}])
      
      df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
      
      == Physical Plan ==
      *Project [a#16314L, b#16310L, c#16315L]
      +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
         :- *Sort [a#16309L ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(a#16309L, 4)
         :     +- Scan ExistingRDD[a#16309L,b#16310L]
         +- *Sort [a#16314L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(a#16314L, 4)
               +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
                  +- Scan ExistingRDD[a#16314L,c#16315L]
      

      If condition on b can be pushed down on df1 then why not condition on a?

      Attachments

        Issue Links

          Activity

            People

              maryannxue Wei Xue
              anbhole Abhijit Bhole
              Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: