Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26689

Single disk broken causing broadcast failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.0, 2.4.0
    • None
    • Spark Core
    • Spark on Yarn

      Mutliple Disk

    Description

      We encoutered an application failure in our production cluster which caused by the bad disk problems. It will incur application failure.

      Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
      org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
      org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
      org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
      org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
      org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
      org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
      org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
      org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
      org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
      org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
      org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
      org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
      org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
      org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
      org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
      scala.Option.foreach(Option.scala:236)
      org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
      org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
      scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
      org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
      org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
      org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
      org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
      

      We have multiple disk on our cluster nodes, however, it still fails. I think it's because spark does not handle bad disk in `DiskBlockManager` currently. 

      Actually, we can handle bad disk in multiple disk environment to avoid application failure.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              liupengcheng liupengcheng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: