Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30602 SPIP: Support push-based shuffle to improve shuffle efficiency
  3. SPARK-34840

Fix cases of corruption in merged shuffle blocks that are pushed

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.1.2, 3.2.0
    • Shuffle
    • None

    Description

      The RemoteBlockPushResolver which handles the shuffle push blocks and merges them was introduced in #30062. We have identified 2 scenarios where the merged blocks get corrupted:

      1. StreamCallback.onFailure() is called more than once. Initially we assumed that the onFailure callback will be called just once per stream. However, we observed that this is called twice when a client connection is reset. When the client connection is reset then there are 2 events that get triggered in this order.
      • exceptionCaught. This event is propagated to StreamInterceptorStreamInterceptor.exceptionCaught() invokes callback.onFailure(streamId, cause). This is the first time StreamCallback.onFailure() will be invoked.
      • channelInactive. Since the channel closes, the channelInactive event gets triggered which again is propagated to StreamInterceptorStreamInterceptor.channelInactive() invokes callback.onFailure(streamId, new ClosedChannelException()). This is the second time StreamCallback.onFailure() will be invoked.
      1. The flag isWriting is set prematurely to true. This introduces an edge case where a stream that is trying to merge a duplicate block (created because of a speculative task) may interfere with an active stream if the duplicate stream fails.

      Also adding additional changes that improve the code.

      1. Using positional writes all the time because this simplifies the code and with microbenchmarking haven't seen any performance impact.
      2. Additional minor changes.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            csingh Chandni Singh Assign to me
            csingh Chandni Singh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment