[SPARK-2048] Optimizations to CPU usage of external spilling code - ASF JIRA

XML

Word

Printable

JSON

In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, there are a few opportunities for optimization:

There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = pair), which we found to be much slower than accessing fields directly
Hash codes for each element are computed many times in StreamBuffer.minKeyHash, which will be expensive for some data types
Uses of buffer.remove() may be expensive if there are lots of hash collisions (better to swap in the last element into that position)
More objects are allocated than is probably necessary, e.g. ArrayBuffers and pairs
Because ExternalAppendOnlyMap is only given one key-value pair at a time, it allocates a new update function on each one, unlike the way we pass a single update function to AppendOnlyMap in Aggregator

These should help because situations where we're spilling are also ones where there is presumably a lot of GC pressure in the new generation.

1.	Eliminate pattern-matching on Tuple2 in performance-critical aggregation code	Resolved	Sandy Ryza
2.	CoGroupedRDD unnecessarily allocates a Tuple2 per dep per key	Resolved	Sandy Ryza
3.	Avoid allocating new ArrayBuffer in groupByKey's mergeCombiner	Resolved	Matei Alexandru Zaharia
4.	Use more compact data structures than ArrayBuffer in groupBy and cogroup	Resolved	Matei Alexandru Zaharia
5.	Update ExternalAppendOnlyMap to take an iterator as input	Resolved	Matei Alexandru Zaharia
6.	Update ExternalAppendOnlyMap to avoid buffer.remove()	Resolved	Matei Alexandru Zaharia