Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-23352

Explicitly specify supported types in Pandas UDFs

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0, 2.4.0
    • PySpark
    • None

    Description

      Currently, we don't support BinaryType in Pandas UDFs:

      >>> from pyspark.sql.functions import pandas_udf
      >>> pudf = pandas_udf(lambda x: x, "binary")
      >>> df = spark.createDataFrame([[bytearray("a")]])
      >>> df.select(pudf("_1")).show()
      ...
      TypeError: Unsupported type in conversion to Arrow: BinaryType
      

      Also, the grouped aggregate Pandas UDF fail fast on ArrayType but seems we can support this case.

      We should better clarify it in doc in Pandas UDFs, and fail fast with type checking ahead, rather than execution time.

      Please consider this case:

      pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage because we know the schema ahead
      

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: