Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.1
Description
There needs to be some handling of category types done when calling createDataFrame with Arrow or the return value of pandas_udf. Without Arrow, Spark casts each element to the category. For example
In [1]: import pandas as pd In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) In [3]: pdf["B"] = pdf["A"].astype('category') In [4]: pdf Out[4]: A B 0 a a 1 b b 2 c c 3 a a In [5]: pdf.dtypes Out[5]: A object B category dtype: object In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) In [8]: df = spark.createDataFrame(pdf) In [9]: df.show() +---+---+ | A| B| +---+---+ | a| a| | b| b| | c| c| | a| a| +---+---+ In [10]: df.printSchema() root |-- A: string (nullable = true) |-- B: string (nullable = true) In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) In [19]: df = spark.createDataFrame(pdf) 1667 spark_type = ArrayType(from_arrow_type(at.value_type)) 1668 else: -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " + str(at)) 1670 return spark_type 1671 TypeError: Unsupported type in conversion from Arrow: dictionary<values=string, indices=int8, ordered=0>
Attachments
Issue Links
- causes
-
SPARK-31963 Support both pandas 0.23 and 1.0
- Resolved
- relates to
-
SPARK-31964 Avoid Pandas import for CategoricalDtype with Arrow conversion
- Resolved
- links to