Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27321

SortMerge join on partition keys causes shuffle

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.3.2
    • None
    • Spark Core

    Description

      We got two datasets thats been persisted as follows: 

      Dataset A: 
      datasetA.repartition(5,datasetA.col("region")) 
                      .write().mode(saveMode) 
                      .format("parquet") 
                      .partitionBy("region") 
                      .bucketBy(5,"studentId") 
                      .sortBy("studentId") 
                      .option("path", parquetFilesDirectory) 
                      .saveAsTable( database.tableA)); 

      Dataset B: 
      datasetB.repartition(5,datasetB.col("region")) 
                      .write().mode(saveMode) 
                      .format("parquet") 
                      .partitionBy("region") 
                      .bucketBy(5,"studentId") 
                      .sortBy("studentId") 
                      .option("path", parquetFilesDirectory) 
                      .saveAsTable( database.tableB)); 

       If we do join just with the bucketed column "studentId", there is NO shuffle as expected. 

      When we join with region and studentId ,we see data shuffle.Below is the join query. 

      spark.sql("Select *  from  database.tableA").join(spark.sql("Select *  from   
      database.tableB "), Seq("studentId","region")).show(10) 

      Note: We cannot use the partition key as a bucket column

      Attachments

        Activity

          People

            Unassigned Unassigned
            lsn24 LSN
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: