Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30101

spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.4.0, 2.4.4
    • None
    • Spark Core

    Description

      I'm creating a `SparkSession` like this:

      ```
      SparkSession
      .builder().appName("foo").master("local")
      .config("spark.default.parallelism", 2).getOrCreate()
      ```

      when I run

      ```
      ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
      ```

      I get 200 partitions

      ```
      19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
      ...
      19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 46 ms on localhost (executor driver) (1/200)
      ```

      It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work correctly.

      Finally I notice that the good old `RDD` interface has a `distinct` that accepts `numPartitions` partitions, while `Dataset` does not.

      .......................

      According to below comments, it uses spark.sql.shuffle.partitions, which needs documenting in configuration.

      > Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

      in https://spark.apache.org/docs/latest/configuration.html should say

      > Default number of partitions in RDDs, but not DS/DF (see spark.sql.shuffle.partitions) returned by transformations like join, reduceByKey, and parallelize when not set by user.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            sams sam
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment