Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46512

Optimize shuffle reading when both sort and combine are used.

Rank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.

       

      I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted.

      sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
      reduceByKey(_ + _, 1).
      asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
      collect().foreach(println) 

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zhengchenyu Chenyu Zheng
            zhengchenyu Chenyu Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment