[SPARK-29188] toPandas gets wrong dtypes when applied on empty DF - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0, 2.4.4
Fix Version/s: 3.0.0
Component/s: PySpark, SQL
Labels:
None
Environment:

Hide

>> uname -a

Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux

>> python

Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
[GCC 7.3.0] on linux

>> conda list
...
openjdk 8.0.192 h1de35cc_1003 conda-forge
pandas 0.25.1 py36h86efe34_0 conda-forge
py4j 0.10.7 py_1 conda-forge
pyspark 2.4.4 py_0 conda-forge
....

Show
>> uname -a Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux >> python Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] on linux >> conda list ... openjdk 8.0.192 h1de35cc_1003 conda-forge pandas 0.25.1 py36h86efe34_0 conda-forge py4j 0.10.7 py_1 conda-forge pyspark 2.4.4 py_0 conda-forge ....

Description

When calling toPandas from an empty dataframe, all dtypes are set to `object`.

spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"])

spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes

Result:

age     object
name    object
date    object
dtype: object

While it gets the correct types when converting the entire dataframe (or at least with 1 line of data) to pandas:

spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"]) 

spark_df.limit(1).toPandas().dtypes

Result:

age              int64
name            object
date    datetime64[ns]
dtype: object

Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?

Attachments

Issue Links

is related to

SPARK-30537 toPandas gets wrong dtypes when applied on empty DF when Arrow enabled

Resolved

links to

GitHub Pull Request #26747

GitHub Pull Request #27247

GitHub Pull Request #27250

Activity

People

Assignee:: David Lindelöf

Reporter:: Radhwane Chebaane

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 20/Sep/19 09:36

Updated:: 12/Dec/22 18:10

Resolved:: 12/Dec/19 11:51