Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28699

Cache an indeterminate RDD could lead to incorrect result while stage rerun

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.3, 2.4.3, 3.0.0
    • Fix Version/s: 2.3.4, 2.4.4, 3.0.0
    • Component/s: Spark Core
    • Labels:

      Description

      It's another case for the indeterminate stage/RDD rerun while stage rerun happened.

      We can reproduce this by the following code, thanks to Tyson for reporting this!
       

      import scala.sys.process._
      import org.apache.spark.TaskContext
      
      val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
      // kill an executor in the stage that performs repartition(239)
      val df = res.repartition(113).cache.repartition(239).map { x =>
       if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
       throw new Exception("pkill -f -n java".!!)
       }
       x
      }
      
      val r2 = df.distinct.count()
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                XuanYuan Yuanjian Li
                Reporter:
                XuanYuan Yuanjian Li
              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: