[SPARK-28699] Cache an indeterminate RDD could lead to incorrect result while stage rerun - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.3.3, 2.4.3, 3.0.0
Fix Version/s: 2.3.4, 2.4.4, 3.0.0
Component/s: Spark Core
Labels:
- correctness

Target Version/s:

2.3.4, 2.4.4

Description

It's another case for the indeterminate stage/RDD rerun while stage rerun happened.

We can reproduce this by the following code, thanks to Tyson for reporting this!

import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
 if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
 throw new Exception("pkill -f -n java".!!)
 }
 x
}

val r2 = df.distinct.count()