Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-23334

Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.3.0
    • 2.3.0
    • PySpark, SQL
    • None

    Description

      In Python 2, when pandas_udf tries to return string type value created in the udf with "..", the execution fails. E.g.,

      from pyspark.sql.functions import pandas_udf, col
      import pandas as pd
      
      df = spark.range(10)
      str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
      df.select(str_f(col('id'))).show()
      

      raises the following exception:

      ...
      
      java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
      	at scala.Predef$.assert(Predef.scala:170)
      	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
      
      ...
      

      Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

      Attachments

        Activity

          People

            ueshin Takuya Ueshin
            ueshin Takuya Ueshin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: