Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.2
-
None
Description
We got two datasets thats been persisted as follows:
Dataset A:
datasetA.repartition(5,datasetA.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableA));
Dataset B:
datasetB.repartition(5,datasetB.col("region"))
.write().mode(saveMode)
.format("parquet")
.partitionBy("region")
.bucketBy(5,"studentId")
.sortBy("studentId")
.option("path", parquetFilesDirectory)
.saveAsTable( database.tableB));
If we do join just with the bucketed column "studentId", there is NO shuffle as expected.
When we join with region and studentId ,we see data shuffle.Below is the join query.
spark.sql("Select * from database.tableA").join(spark.sql("Select * from
database.tableB "), Seq("studentId","region")).show(10)
Note: We cannot use the partition key as a bucket column