Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-23380

Adds a conf for Arrow fallback in toPandas/createDataFrame with Pandas DataFrame

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • PySpark
    • None

    Description

      Seems we can check the schema ahead and fall back in toPandas.

      Please see this case below:

      df = spark.createDataFrame([[{'a': 1}]])
      
      spark.conf.set("spark.sql.execution.arrow.enabled", "false")
      df.toPandas()
      spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      df.toPandas()
      
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython.
      ...
      java.lang.UnsupportedOperationException: Unsupported data type: map<string,bigint>
      

      In case of createDataFrame, we fall back to make this at least working even though the optimisation is disabled.

      df = spark.createDataFrame([[{'a': 1}]])
      spark.conf.set("spark.sql.execution.arrow.enabled", "false")
      pdf = df.toPandas()
      spark.createDataFrame(pdf).show()
      spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      spark.createDataFrame(pdf).show()
      
      ...
      ... UserWarning: Arrow will not be used in createDataFrame: Error inferring Arrow type ...
      +--------+
      |      _1|
      +--------+
      |[a -> 1]|
      +--------+
      

      We need to match the behaviours and add a configuration to control the behaviour.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: