[SPARK-28881] toPandas with Arrow should not return a DataFrame when the result size exceeds `spark.driver.maxResultSize` - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3
Fix Version/s: 2.4.4, 3.0.0
Component/s: PySpark, SQL
Labels:
None

Description

./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
spark.range(10000000).toPandas()

The codes above returns an empty dataframe in Spark 2.4 but It should throw an exception as below:

py4j.protocol.Py4JJavaError: An error occurred while calling o31.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (3.0 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)

This is a regression between Spark 2.3 and 2.4.
This JIRA targets to add a regression test.

In Spark 2.4:

./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
spark.range(10000000).toPandas()

Empty DataFrame
Columns: [id]
Index: []

or it can return partial results:

./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(0, 330000, 1, 100).toPandas()

...
75897  75897
75898  75898
75899  75899

[75900 rows x 1 columns]

Attachments

Issue Links

is duplicated by

SPARK-27039 toPandas with Arrow swallows maxResultSize errors

Resolved

relates to

SPARK-27992 PySpark socket server should sync with JVM connection thread future

Resolved

links to

GitHub Pull Request #25594

GitHub Pull Request #25603

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 27/Aug/19 04:45

Updated:: 12/Dec/22 18:10

Resolved:: 27/Aug/19 08:30