[SPARK-25911] [spark-ml] Hypothesis testing module - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
None

Description

Why this ticket was created

Feasibility determination of some subset of hypothesis testing module mainly along value proposition front and to get a preliminary opinion of how does it generally sound. Can work on a more comprehensive proposal if say, it's generally agreed upon that including dataframe API for t-test makes sense in the o.a.s.ml package.

Current state

There are some streaming implementation in the o.a.s.mllib module, but there are no dataframe APIs for some standard tests (t-test).

Test	Current state	Proposed state
t-test (welch's, student)	only streaming	Dataframe API
chi-squared	streaming, Dataframe/RDD API present	-
ANOVA	-	Dataframe API
mann-whitney-u-test	-	RDD API (in maintenance mode so probably doesn't make sense to include this)

Rationale

The utility of experimentation platforms is pervasive and most of them that operate at scale (a large portion of them use spark for offline computation) require distributed implementation of hypothesis tests to calculate p-values of different metrics/features. These APIs would enable distributed computation of the relevant stats and prevent overhead in moving data (or some downstream view of it) to a framework where such stats computation is available (R, scipy).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Uday Babbar

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Nov/18 18:08

Updated:: 02/Mar/19 21:30

Resolved:: 02/Mar/19 21:30