Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0, 2.4.4
-
None
-
>> uname -a
Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux
>> python
Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
[GCC 7.3.0] on linux>> conda list
...
openjdk 8.0.192 h1de35cc_1003 conda-forge
pandas 0.25.1 py36h86efe34_0 conda-forge
py4j 0.10.7 py_1 conda-forge
pyspark 2.4.4 py_0 conda-forge
....>> uname -a Linux XXXXXXXXXXXXXXXX 4.14.104-95.84.amzn2.x86_64 #1 SMP Sat Mar 2 00:40:20 UTC 2019 x86_64 GNU/Linux >> python Python 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] on linux >> conda list ... openjdk 8.0.192 h1de35cc_1003 conda-forge pandas 0.25.1 py36h86efe34_0 conda-forge py4j 0.10.7 py_1 conda-forge pyspark 2.4.4 py_0 conda-forge ....
Description
When calling toPandas from an empty dataframe, all dtypes are set to `object`.
spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"]) spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes
Result:
age object name object date object dtype: object
While it gets the correct types when converting the entire dataframe (or at least with 1 line of data) to pandas:
spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"]) spark_df.limit(1).toPandas().dtypes
Result:
age int64 name object date datetime64[ns] dtype: object
Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?
Attachments
Issue Links
- is related to
-
SPARK-30537 toPandas gets wrong dtypes when applied on empty DF when Arrow enabled
- Resolved
- links to