[SPARK-36782] Deadlock between map-output-dispatcher and dispatcher-BlockManagerMaster upon migrating shuffle blocks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0
Fix Version/s: 3.2.0, 3.1.3
Component/s: Block Manager
Labels:
None

Description

I can observe a deadlock on the driver that can be triggered rather reliably in a job with a larger amount of tasks - upon using

spark.decommission.enabled: true
spark.storage.decommission.rddBlocks.enabled: true
spark.storage.decommission.shuffleBlocks.enabled: true
spark.storage.decommission.enabled: true

It origins in the dispatcher-BlockManagerMaster making a call to updateBlockInfo when shuffles are migrated. This is not performed by a thread from the pool but instead by the dispatcher-BlockManagerMaster itself. I suppose this was done under the assumption that this would be very fast. However if the block that is updated is a shuffle index block it calls

mapOutputTracker.updateMapOutput(shuffleId, mapId, blockManagerId)

for which it waits to acquire a write lock as part of the MapOutputTracker.

If the timing is bad then one of the map-output-dispatchers are holding this lock as part of e.g. serializedMapStatus. In this function MapOutputTracker.serializeOutputStatuses is called and as part of that we do

if (arrSize >= minBroadcastSize) {
 // Use broadcast instead.
 // Important arr(0) is the tag == DIRECT, ignore that while deserializing !
 // arr is a nested Array so that it can handle over 2GB serialized data
 val arr = chunkedByteBuf.getChunks().map(_.array())
 val bcast = broadcastManager.newBroadcast(arr, isLocal)

which makes an RPC call to dispatcher-BlockManagerMaster. That one however is unable to answer as it is blocked while waiting on the aforementioned lock. Hence the deadlock. The ingredients of this deadlock are therefore: sufficient size of the array to go the broadcast-path, as well as timing of incoming updateBlockInfo call as happens regularly during decommissioning. Potentially earlier versions than 3.1.0 are affected but I could not sufficiently conclude that.

I have a stacktrace of all driver threads showing the deadlock: spark_stacktrace_deadlock.txt

A coworker of mine wrote a patch that replicates the issue as a test case as well: 0001-Add-test-showing-that-decommission-might-deadlock.patch