[SPARK-27925] Better control numBins of curves in BinaryClassificationMetrics - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

In case of large datasets with tens of thousands of partitions, current curve down-sampling method tend to generate much more bins than the value set by #numBins.

Since in current impl, grouping is done within partitions, that is to say, each partition contains at least one bin.

A more reasonable way is to bring the grouping op forward into the sort op, then we can directly set the #bins as the #partitions, and regard one partition as one bin.

Attachments

Issue Links

links to

GitHub Pull Request #24775

Activity

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Jun/19 09:44

Updated:: 14/Jun/19 02:41

Resolved:: 14/Jun/19 02:41