[SPARK-25351] Handle Pandas category type when converting from Python with Arrow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 3.1.0
Component/s: PySpark
Labels:
- bulk-closed

Description

There needs to be some handling of category types done when calling createDataFrame with Arrow or the return value of pandas_udf. Without Arrow, Spark casts each element to the category. For example

In [1]: import pandas as pd

In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]})

In [3]: pdf["B"] = pdf["A"].astype('category')

In [4]: pdf
Out[4]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

In [5]: pdf.dtypes
Out[5]: 
A      object
B    category
dtype: object

In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False)

In [8]: df = spark.createDataFrame(pdf)

In [9]: df.show()
+---+---+
|  A|  B|
+---+---+
|  a|  a|
|  b|  b|
|  c|  c|
|  a|  a|
+---+---+


In [10]: df.printSchema()
root
 |-- A: string (nullable = true)
 |-- B: string (nullable = true)

In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True)

In [19]: df = spark.createDataFrame(pdf)   

   1667         spark_type = ArrayType(from_arrow_type(at.value_type))
   1668     else:
-> 1669         raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
   1670     return spark_type
   1671 

TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>

Attachments

Issue Links

causes

SPARK-31963 Support both pandas 0.23 and 1.0

Resolved

relates to

SPARK-31964 Avoid Pandas import for CategoricalDtype with Arrow conversion

Resolved

links to

[Github] Pull Request #28659 (BryanCutler)

GitHub Pull Request #26585

Activity

People

Assignee:: Jalpan Randeri

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Sep/18 21:09

Updated:: 11/Jun/20 00:04

Resolved:: 28/May/20 00:28