Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29998

A corrupted hard disk causes the task to execute repeatedly on a machine until the job fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.4.0
    • None
    • Spark Core
    • None

    Description

      Recently, I meet one situation:

      One NodeManager's disk is broken. when task begin to run, it will get jobConf by broadcast, executor's BlockManager failed to create file. and throw IOException.

      19/11/22 15:14:36 INFO org.apache.spark.scheduler.DAGScheduler: "ShuffleMapStage 342 (run at AccessController.java:0) failed in 0.400 s due to Job aborted due to stage failure: Task 21 in st
      age 343.0 failed 4 times, most recent failure: Lost task 21.3 in stage 343.0 (TID 34968, hostname, executor 104): java.io.IOException: Failed to create local dir in /disk
      11/yarn/local/usercache/username/appcache/application_1573542949548_2889852/blockmgr-a70777d8-5159-48e7-a47e-848df01a831e/3b.
              at org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
              at org.apache.spark.storage.DiskStore.contains(DiskStore.scala:129)
              at org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:605)
              at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:214)
              at scala.Option.getOrElse(Option.scala:121)
              at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
              at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
              at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
              at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
              at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
              at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
              at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
              at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:144)
              at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:228)
              at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
              at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
              at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
              at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
              at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
              at org.apache.spark.scheduler.Task.run(Task.scala:121)
              at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
              at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      
      

      Since in TaskSetManager.handleFailedTask()
      For this kind of fail reason, it will retry on this Executor until `failedTime > maxTaskFailTime `
      Then this stage failed, total job failed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              angerszhuuu angerszhu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: