Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34794

Nested higher-order functions broken in DSL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1, 3.2.0
    • 3.0.3, 3.1.2, 3.2.0
    • SQL
    • 3.1.1

    Description

      In Spark 3, if I have:

      val df = Seq(
          (Seq(1,2,3), Seq("a", "b", "c"))
      ).toDF("numbers", "letters")
      

      and I want to take the cross product of these two arrays, I can do the following in SQL:

      df.selectExpr("""
          FLATTEN(
              TRANSFORM(
                  numbers,
                  number -> TRANSFORM(
                      letters,
                      letter -> (number AS number, letter AS letter)
                  )
              )
          ) AS zipped
      """).show(false)
      +------------------------------------------------------------------------+
      |zipped                                                                  |
      +------------------------------------------------------------------------+
      |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
      +------------------------------------------------------------------------+
      

      This works fine. But when I try the equivalent using the scala DSL, the result is wrong:

      df.select(
          f.flatten(
              f.transform(
                  $"numbers",
                  (number: Column) => { f.transform(
                      $"letters",
                      (letter: Column) => { f.struct(
                          number.as("number"),
                          letter.as("letter")
                      ) }
                  ) }
              )
          ).as("zipped")
      ).show(10, false)
      +------------------------------------------------------------------------+
      |zipped                                                                  |
      +------------------------------------------------------------------------+
      |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
      +------------------------------------------------------------------------+
      

      Note that the numbers are not included in the output. The explain for this second version is:

      == Parsed Logical Plan ==
      'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444]
      +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
         +- LocalRelation [_1#303, _2#304]
      
      == Analyzed Logical Plan ==
      zipped: array<struct<number:string,letter:string>>
      Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444]
      +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
         +- LocalRelation [_1#303, _2#304]
      
      == Optimized Logical Plan ==
      LocalRelation [zipped#444]
      
      == Physical Plan ==
      LocalTableScan [zipped#444]
      

      Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647

      Attachments

        Issue Links

          Activity

            People

              dsolow Daniel Solow
              dsolow1 Daniel Solow
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: