[HIVE-20108] Investigate alternatives to groupByKey - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Spark
Labels:
None

Target Version/s:

4.0.0

Description

We use groupByKey for aggregations (or if hive.spark.use.groupby.shuffle is false we use repartitionAndSortWithinPartitions).

groupByKey has its drawbacks because it can't spill records within a single key group. It also seems to be doing some unnecessary work in Spark's Aggregator (not positive about this part).

repartitionAndSortWithinPartitions is better, but the sorting within partitions isn't necessary for aggregations.

Attachments

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Jul/18 20:55

Updated:: 11/Jul/18 20:06