[HIVE-20141] Turn hive.spark.use.groupby.shuffle off by default - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Spark
Labels:
None

Target Version/s:

4.0.0

Description

xuefuz any thoughts on this? I think it would provide better out of the box behavior for Hive-on-Spark users, especially for users who are migrating from Hive-on-MR to HoS. Wondering what your experience with this config has been?

I've done a bunch of performance profiling with this config turned on vs. off, and for TPC-DS queries it doesn't make a significant difference. The main difference I can see is that when a Spark stage has to spill to disk, repartitionAndSortWithinPartitions spills more data to disk than groupByKey - my guess is that this happens because groupByKey stores everything in Spark's ExternalAppendOnlyMap (which only stores a single copy of the key for potentially multiple values) whereas repartitionAndSortWithinPartitions uses Spark's ExternalSorter which sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results in more data being spilled to disk).

My understanding is that using repartitionAndSortWithinPartitions for Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config would provide a similar experience to HoMR. Furthermore, last I checked, groupByKey still can't spill within a row group.

Attachments

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sahil Takiar

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Jul/18 15:12

Updated:: 11/Jul/18 17:48