[SPARK-23334] Fix pandas_udf with return type StringType() to handle str type properly in Python 2. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: PySpark, SQL
Labels:
None

Target Version/s:

2.3.0

Description

In Python 2, when pandas_udf tries to return string type value created in the udf with "..", the execution fails. E.g.,

from pyspark.sql.functions import pandas_udf, col
import pandas as pd

df = spark.range(10)
str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
df.select(str_f(col('id'))).show()

raises the following exception:

...

java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
	at scala.Predef$.assert(Predef.scala:170)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)

...

Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

Attachments

Issue Links

links to

[Github] Pull Request #20507 (ueshin)

Activity

People

Assignee:: Takuya Ueshin

Reporter:: Takuya Ueshin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Feb/18 09:50

Updated:: 12/Dec/22 18:10

Resolved:: 06/Feb/18 09:31