[SPARK-27321] SortMerge join on partition keys causes shuffle - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

We got two datasets thats been persisted as follows:

Dataset A:
datasetA.repartition(5,datasetA.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableA));

Dataset B:
datasetB.repartition(5,datasetB.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableB));

If we do join just with the bucketed column "studentId", there is NO shuffle as expected.

When we join with region and studentId ,we see data shuffle.Below is the join query.

spark.sql("Select * from database.tableA").join(spark.sql("Select * from
database.tableB "), Seq("studentId","region")).show(10)

Note: We cannot use the partition key as a bucket column

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: LSN

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 28/Mar/19 23:21

Updated:: 25/May/21 01:52

Resolved:: 25/May/21 01:39