Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28480

Types of input parameters of a UDF affect the ability to cache the result

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.3.1
    • 2.4.3
    • Spark Core
    • None

    Description

      When I define a parameter in a UDF as Boolean or Int the result DataFrame can't be cachedĀ 

      import org.apache.spark.sql.functions.{lit, udf}
      val empty = sparkSession.emptyDataFrame
      val table = "table"
      
      def test(customUDF: UserDefinedFunction, col: Column): Unit = {
        val df = empty.select(customUDF(col))
        df.cache()
        df.createOrReplaceTempView(table)
        println(sparkSession.catalog.isCached(table))
      }
      
      test(udf { _: String => 42 }, lit("")) // true
      test(udf { _: Any => 42 }, lit("")) // true
      test(udf { _: Int => 42 }, lit(42)) // false
      test(udf { _: Boolean => 42 }, lit(false)) // false
      

      orĀ sparkSession.catalog.isCached gives irrelevant information.

      Attachments

        1. image-2019-07-23-10-58-45-768.png
          128 kB
          Shivu Sondur

        Activity

          People

            Unassigned Unassigned
            itsukanov Ivan Tsukanov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: