Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28480

Types of input parameters of a UDF affect the ability to cache the result

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.4.3
    • Component/s: Spark Core
    • Labels:
      None

      Description

      When I define a parameter in a UDF as Boolean or Int the result DataFrame can't be cachedĀ 

      import org.apache.spark.sql.functions.{lit, udf}
      val empty = sparkSession.emptyDataFrame
      val table = "table"
      
      def test(customUDF: UserDefinedFunction, col: Column): Unit = {
        val df = empty.select(customUDF(col))
        df.cache()
        df.createOrReplaceTempView(table)
        println(sparkSession.catalog.isCached(table))
      }
      
      test(udf { _: String => 42 }, lit("")) // true
      test(udf { _: Any => 42 }, lit("")) // true
      test(udf { _: Int => 42 }, lit(42)) // false
      test(udf { _: Boolean => 42 }, lit(false)) // false
      

      orĀ sparkSession.catalog.isCached gives irrelevant information.

        Attachments

        1. image-2019-07-23-10-58-45-768.png
          128 kB
          Shivu Sondur

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              itsukanov Ivan Tsukanov
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: