Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47927

Nullability after join not respected in UDF

    XMLWordPrintableJSON

Details

    Description

      val ds1 = Seq(1).toDS()
      val ds2 = Seq[Int]().toDS()
      val f = udf[(Int, Option[Int]), (Int, Option[Int])](identity)
      ds1.join(ds2, ds1("value") === ds2("value"), "outer").select(f(struct(ds1("value"), ds2("value")))).show()
      ds1.join(ds2, ds1("value") === ds2("value"), "outer").select(struct(ds1("value"), ds2("value"))).show() 

      outputs

      +---------------------------------------+
      |UDF(struct(value, value, value, value))|
      +---------------------------------------+
      |                                 {1, 0}|
      +---------------------------------------+
      
      +--------------------+
      |struct(value, value)|
      +--------------------+
      |           {1, NULL}|
      +--------------------+ 

      So when the result is passed to UDF the null-ability after the the join is not respected and we incorrectly end up with a 0 value instead of a null/None value.

      Attachments

        Issue Links

          Activity

            People

              eejbyfeldt Emil Ejbyfeldt
              eejbyfeldt Emil Ejbyfeldt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: