Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20108

Investigate alternatives to groupByKey

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Spark
    • None

    Description

      We use groupByKey for aggregations (or if hive.spark.use.groupby.shuffle is false we use repartitionAndSortWithinPartitions).

      groupByKey has its drawbacks because it can't spill records within a single key group. It also seems to be doing some unnecessary work in Spark's Aggregator (not positive about this part).

      repartitionAndSortWithinPartitions is better, but the sorting within partitions isn't necessary for aggregations.

      Attachments

        Activity

          People

            stakiar Sahil Takiar
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: