Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44019

Unable to deserialize broadcasted map statuses when executor decommissioned

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 3.1.0, 3.2.0, 3.3.0, 3.4.0
    • None
    • Shuffle
    • Spark version: 3.3.1
      running on Kubernetes

    Description

      during examination of graceful executor decommission at high rate of interruptions, jobs occasionally abort on the exception(which isn't reproducd when decommission is disabled):

      org.apache.spark.shuffle.MetadataFetchFailedException: Unable to deserialize broadcasted map statuses for shuffle xx: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_yyy_pieceZ of broadcast_yyy

      This exceptions are reproduced from reducer tasks while multiple decommission are in progress or finished which most of the time causes the entire stage to fail due to its highly reproduced.

      This seems to be related to `updateMapOuput` broadcast invalidation which destroys and creates those broadcast variables on each shuffle migration(reference).

      Example sequence of events from driver:

      2023-06-03 18:41:17:798 INFO TorrentBroadcast:61 Destroying Broadcast(171) (from updateMapOutput at BlockManagerMasterEndpoint.scala:639)
      ...
      2023-06-03 18:41:17:811 INFO BlockManagerInfo:61 Removed broadcast_171_piece0 on xxx_host_xxx:7079 in memory (size: 4.0 MiB, free: 10.4 GiB)
      ...
      2023-06-03 18:41:17:841 WARN TaskSetManager:73 Lost task 6.0 in stage 124.2 (TID xxxxx) (x.x.x.x executor xxxx): FetchFailed(null, shuffleId=xx, mapIndex=-1, mapId=-1, reduceId=-1, message=
      org.apache.spark.shuffle.MetadataFetchFailedException: Unable to deserialize broadcasted map statuses for shuffle xx: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_171_piece0 of broadcast_171{{}}

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            david.klinberg David Klinberg
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: