Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46512

Optimize shuffle reading when both sort and combine are used.

    XMLWordPrintableJSON

Details

    Description

      After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.

       

      I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted.

      sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
      reduceByKey(_ + _, 1).
      asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
      collect().foreach(println) 

      Attachments

        Issue Links

          Activity

            People

              zhengchenyu Chenyu Zheng
              zhengchenyu Chenyu Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: