[SPARK-26689] Single disk broken causing broadcast failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.1.0, 2.4.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed
Environment:

Spark on Yarn

Mutliple Disk

Description

We encoutered an application failure in our production cluster which caused by the bad disk problems. It will incur application failure.

Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

We have multiple disk on our cluster nodes, however, it still fails. I think it's because spark does not handle bad disk in `DiskBlockManager` currently.

Actually, we can handle bad disk in multiple disk environment to avoid application failure.

Attachments

Issue Links

links to

GitHub Pull Request #23614

Activity

People

Assignee:: Unassigned

Reporter:: liupengcheng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Jan/19 11:37

Updated:: 25/May/21 01:52

Resolved:: 25/May/21 01:45