Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29188

toPandas gets wrong dtypes when applied on empty DF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0, 2.4.4
    • 3.0.0
    • PySpark, SQL
    • None

    Description

      When calling toPandas from an empty dataframe, all dtypes are set to `object`.

      spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"])
      
      spark.createDataFrame(spark.sparkContext.emptyRDD(), schema=spark_df.schema).toPandas().dtypes 
      

      Result: 

      age     object
      name    object
      date    object
      dtype: object
      

       

      While it gets the correct types when converting the entire dataframe (or at least with 1 line of data) to pandas:

      spark_df = spark.createDataFrame([(10, "Emy", datetime.today() ), (11, "Bob", datetime.today())], ["age", "name", "date"]) 
      
      spark_df.limit(1).toPandas().dtypes 
      

       Result:

      age              int64
      name            object
      date    datetime64[ns]
      dtype: object
      

       

      Is this intended ? Why toPandas does not rely on the Spark DataFrame Schema ?

      Attachments

        Issue Links

          Activity

            People

              dlindelof David Lindelöf
              radcheb Radhwane Chebaane
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: