Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38679

Expose the number partitions in a stage to TaskContext

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.1
    • 3.4.0
    • Spark Core
    • None

    Description

      Add a new api to expose total partition count in the stage belonging to the task in TaskContext, so that the task knows what fraction of the computation is doing.

      With this extra information, users can also generate 32bit unique int ids as below rather than using `monotonically_increasing_id` which generates 64bit long ids.

       

         rdd.mapPartitions { rowsIter =>
              val partitionId = TaskContext.get().partitionId()
              val numPartitions = TaskContext.get().numPartitions()
              var i = 0
              rowsIter.map { row =>
                val rowId = partitionId + i * numPartitions
                i += 1
                (rowId, row)
             }
        }

       

      Attachments

        Activity

          People

            vkorukanti Venki Korukanti
            vkorukanti Venki Korukanti
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: