Description
Why this ticket was created
Feasibility determination of some subset of hypothesis testing module mainly along value proposition front and to get a preliminary opinion of how does it generally sound. Can work on a more comprehensive proposal if say, it's generally agreed upon that including dataframe API for t-test makes sense in the o.a.s.ml package.
Current state
There are some streaming implementation in the o.a.s.mllib module, but there are no dataframe APIs for some standard tests (t-test).
Test | Current state | Proposed state |
---|---|---|
t-test (welch's, student) | only streaming | Dataframe API |
chi-squared | streaming, Dataframe/RDD API present | - |
ANOVA | - | Dataframe API |
mann-whitney-u-test | - | RDD API (in maintenance mode so probably doesn't make sense to include this) |
Rationale
The utility of experimentation platforms is pervasive and most of them that operate at scale (a large portion of them use spark for offline computation) require distributed implementation of hypothesis tests to calculate p-values of different metrics/features. These APIs would enable distributed computation of the relevant stats and prevent overhead in moving data (or some downstream view of it) to a framework where such stats computation is available (R, scipy).