[SPARK-22285] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.

Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more.

This leads to a huge method in codegen and causes errors such as `org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB`.

Besides, huge generated methods also result in performance regression.

Attachments

Issue Links

links to

[Github] Pull Request #19506 (wzhfy)

Activity

People

Assignee:: Zhenhua Wang

Reporter:: Zhenhua Wang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 16/Oct/17 08:49

Updated:: 23/Oct/17 22:12

Resolved:: 23/Oct/17 22:03