[SPARK-46512] Optimize shuffle reading when both sort and combine are used. - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Create sub-task

Convert to sub-task

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Shuffle, Spark Core
Labels:
- pull-request-available

Description

After the shuffle reader obtains the block, it will first perform a combine operation, and then perform a sort operation. It is known that both combine and sort may generate temporary files, so the performance may be poor when both sort and combine are used. In fact, combine operations can be performed during the sort process, and we can avoid the combine spill file.

I did not find any direct api to construct the shuffle which both sort and combine is used. But I can do like following code, here is a wordcount, and the output words is sorted.

sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
reduceByKey(_ + _, 1).
asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
collect().foreach(println)