Description
Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. For orderBy, Hive accumulates key groups using ArrayList (described in HIVE-15527). For groupBy, Hive currently uses Spark's groupByKey operator, which has a shortcoming of not being able to spill to disk within a key group. Thus, for large key group, memory usage is also unbounded.
It's likely that this will impact performance. We will profile and optimize afterwards. We could also make this change configurable.
Attachments
Attachments
Issue Links
- incorporates
-
HIVE-15527 Memory usage is unbound in SortByShuffler for Spark
- Resolved
- relates to
-
HIVE-15682 Eliminate per-row based dummy iterator creation
- Resolved
-
HIVE-15683 Make what's done in HIVE-15580 for group by configurable
- Resolved