Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30101

spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0, 2.4.4
    • None
    • Spark Core

    Description

      I'm creating a `SparkSession` like this:

      ```
      SparkSession
      .builder().appName("foo").master("local")
      .config("spark.default.parallelism", 2).getOrCreate()
      ```

      when I run

      ```
      ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
      ```

      I get 200 partitions

      ```
      19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
      ...
      19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/200)
      ```

      It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work correctly.

      Finally I notice that the good old `RDD` interface has a `distinct` that accepts `numPartitions` partitions, while `Dataset` does not.

      .......................

      According to below comments, it uses spark.sql.shuffle.partitions, which needs documenting in configuration.

      > Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

      in https://spark.apache.org/docs/latest/configuration.html should say

      > Default number of partitions in RDDs, but not DS/DF (see spark.sql.shuffle.partitions) returned by transformations like join, reduceByKey, and parallelize when not set by user.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sams sam
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: