Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-36782

Deadlock between map-output-dispatcher and dispatcher-BlockManagerMaster upon migrating shuffle blocks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0
    • 3.2.0, 3.1.3
    • Block Manager
    • None

    Description

      I can observe a deadlock on the driver that can be triggered rather reliably in a job with a larger amount of tasks - upon using

      spark.decommission.enabled: true
      spark.storage.decommission.rddBlocks.enabled: true
      spark.storage.decommission.shuffleBlocks.enabled: true
      spark.storage.decommission.enabled: true

       

      It origins in the dispatcher-BlockManagerMaster making a call to updateBlockInfo when shuffles are migrated. This is not performed by a thread from the pool but instead by the dispatcher-BlockManagerMaster itself. I suppose this was done under the assumption that this would be very fast. However if the block that is updated is a shuffle index block it calls

      mapOutputTracker.updateMapOutput(shuffleId, mapId, blockManagerId)

      for which it waits to acquire a write lock as part of the MapOutputTracker.

      If the timing is bad then one of the map-output-dispatchers are holding this lock as part of e.g. serializedMapStatus. In this function MapOutputTracker.serializeOutputStatuses is called and as part of that we do

      if (arrSize >= minBroadcastSize) {
       // Use broadcast instead.
       // Important arr(0) is the tag == DIRECT, ignore that while deserializing !
       // arr is a nested Array so that it can handle over 2GB serialized data
       val arr = chunkedByteBuf.getChunks().map(_.array())
       val bcast = broadcastManager.newBroadcast(arr, isLocal)

      which makes an RPC call to dispatcher-BlockManagerMaster. That one however is unable to answer as it is blocked while waiting on the aforementioned lock. Hence the deadlock. The ingredients of this deadlock are therefore: sufficient size of the array to go the broadcast-path, as well as timing of incoming updateBlockInfo call as happens regularly during decommissioning. Potentially earlier versions than 3.1.0 are affected but I could not sufficiently conclude that.

      I have a stacktrace of all driver threads showing the deadlock: spark_stacktrace_deadlock.txt

      A coworker of mine wrote a patch that replicates the issue as a test case as well: 0001-Add-test-showing-that-decommission-might-deadlock.patch

      Attachments

        Activity

          People

            fthiele Fabian Thiele
            fthiele Fabian Thiele
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: