Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.1, 3.2.1
-
None
-
None
Description
Hi,
The results when using a non deterministic function in repartition (like rand) leads into incorrect results.
Reproduce: (correct)
// code placeholder import scala.sys.process._ import org.apache.spark.TaskContext import org.apache.spark.sql.functions.rand val res = spark.range(0, 100 * 100, 1).repartition(200).map { x => x }.repartition(200).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count()
The correct result 10000
Reproduce: (bad)
// code placeholder import scala.sys.process._ import org.apache.spark.TaskContext import org.apache.spark.sql.functions.rand val res = spark.range(0, 100 * 100, 1).repartition(200).map { x => x }.repartition(10, Array(rand):_*).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) { throw new Exception("pkill -f java".!!) } x } res.distinct().count()
The bad result 9396