Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
3.0.0
-
None
-
None
Description
In case of large datasets with tens of thousands of partitions, current curve down-sampling method tend to generate much more bins than the value set by #numBins.
Since in current impl, grouping is done within partitions, that is to say, each partition contains at least one bin.
A more reasonable way is to bring the grouping op forward into the sort op, then we can directly set the #bins as the #partitions, and regard one partition as one bin.
Attachments
Issue Links
- links to