Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28881

toPandas with Arrow should not return a DataFrame when the result size exceeds `spark.driver.maxResultSize`

    XMLWordPrintableJSON

Details

    • Test
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0, 2.4.1, 2.4.2, 2.4.3
    • 2.4.4, 3.0.0
    • PySpark, SQL
    • None

    Description

      ./bin/pyspark --conf spark.driver.maxResultSize=1m
      spark.conf.set("spark.sql.execution.arrow.enabled",True)
      spark.range(10000000).toPandas()
      

      The codes above returns an empty dataframe in Spark 2.4 but It should throw an exception as below:

      py4j.protocol.Py4JJavaError: An error occurred while calling o31.collectToPython.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (3.0 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
      

      This is a regression between Spark 2.3 and 2.4.
      This JIRA targets to add a regression test.

      In Spark 2.4:

      ./bin/pyspark --conf spark.driver.maxResultSize=1m
      spark.conf.set("spark.sql.execution.arrow.enabled",True)
      spark.range(10000000).toPandas()
      
      Empty DataFrame
      Columns: [id]
      Index: []
      

      or it can return partial results:

      ./bin/pyspark --conf spark.driver.maxResultSize=1m
      spark.conf.set("spark.sql.execution.arrow.enabled", True)
      spark.range(0, 330000, 1, 100).toPandas()
      
      ...
      75897  75897
      75898  75898
      75899  75899
      
      [75900 rows x 1 columns]
      

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: