Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32553

Spark application failed due to stage fatch failed without retry

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Abandoned
    • 2.3.4, 3.0.0
    • None
    • Spark Core

    Description

      We got a exception when running a spark application under spark 2.3.4 and spark 3.0 using conf : spark.shuffle.useOldFetchProtocol=true, the application failed due to stage fatch failed and the stage not retry.
      code like following:

      val Array(input) = args
      
      val sparkConf = new SparkConf().setAppName("Spark Fatch Failed Test")
      // for running directly in IDE
      sparkConf.setIfMissing("spark.master", "local[2]")
      val sc = new SparkContext(sparkConf)
      
      val lines = sc.textFile(input)
        .repartition(1)
        .map(data => data.trim)
        .repartition(1)
      
      val doc = lines.map(data => (data, 1)).reduceByKey(_ + _).collect()

      The application DAG like following:
       

      If stage 3 failed due to fatch failed,  the application will not retry stage 2 and stage 3 and fail the job. Because spark think stage 2 and stage 3 are non-retryable, rdds in stage 2 and stage 3 is INDETERMINATE.
       
      Actually, if shuffle result belongs to stage 1 exist completely, stage 2 and stage 3 are retryable, because rdds in them is not order-sensitive. If allow stage 2 and stage 3 to retry, we have trouble in handling DAGScheduler.getMissingParentStages. And i am not sure if DAGScheduler.getMissingParentStages breaks the rule that INDETERMINATE rdd non-retryable.
       
      I would appreciate it if someone would reply.
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            wangshengjie wangshengjie
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: