Description
When whole-stage codegen is enabled, in face of integer overflow, SparkSession.range()'s behavior is inconsistent with when codegen is turned off, while the latter is consistent with SparkContext.range()'s behavior.
The following Spark Shell session shows the inconsistency:
scala> sc.range def range(start: Long,end: Long,step: Long,numSlices: Int): org.apache.spark.rdd.RDD[Long] scala> spark.range def range(start: Long,end: Long,step: Long,numPartitions: Int): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long,step: Long): org.apache.spark.sql.Dataset[Long] def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long] def range(end: Long): org.apache.spark.sql.Dataset[Long] scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res1: Array[Long] = Array() scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 9223372036854775806) scala> spark.conf.set("spark.sql.codegen.wholeStage", false) scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 1).collect res5: Array[Long] = Array()
Attachments
Issue Links
- is related to
-
SPARK-21044 Add `RemoveInvalidRange` optimizer
- Closed
- links to