Details
Description
./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
spark.range(10000000).toPandas()
The codes above returns an empty dataframe in Spark 2.4 but It should throw an exception as below:
py4j.protocol.Py4JJavaError: An error occurred while calling o31.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 1 tasks (3.0 MB) is bigger than spark.driver.maxResultSize (1024.0 KB)
This is a regression between Spark 2.3 and 2.4.
This JIRA targets to add a regression test.
In Spark 2.4:
./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled",True)
spark.range(10000000).toPandas()
Empty DataFrame Columns: [id] Index: []
or it can return partial results:
./bin/pyspark --conf spark.driver.maxResultSize=1m
spark.conf.set("spark.sql.execution.arrow.enabled", True)
spark.range(0, 330000, 1, 100).toPandas()
... 75897 75897 75898 75898 75899 75899 [75900 rows x 1 columns]
Attachments
Issue Links
- is duplicated by
-
SPARK-27039 toPandas with Arrow swallows maxResultSize errors
- Resolved
- relates to
-
SPARK-27992 PySpark socket server should sync with JVM connection thread future
- Resolved
- links to