Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25911

[spark-ml] Hypothesis testing module

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • ML, MLlib
    • None

    Description

      Why this ticket was created

      Feasibility determination of some subset of hypothesis testing module mainly along value proposition front and to get a preliminary opinion of how does it generally sound. Can work on a more comprehensive proposal if say, it's generally agreed upon that including dataframe API for t-test makes sense in the o.a.s.ml package. 

      Current state

      There are some streaming implementation in the o.a.s.mllib module, but there are no dataframe APIs for some standard tests (t-test). 

      Test  Current state Proposed state
      t-test (welch's, student) only streaming  Dataframe API
      chi-squared streaming, Dataframe/RDD API present  - 
      ANOVA - Dataframe API
      mann-whitney-u-test - RDD API (in maintenance mode so probably doesn't make sense to include this)

      Rationale 

      The utility of experimentation platforms is pervasive and most of them that operate at scale (a large portion of them use spark for offline computation) require distributed implementation of hypothesis tests to calculate p-values of different metrics/features. These APIs would enable distributed computation of the relevant stats and prevent overhead in moving data (or some downstream view of it) to a framework where such stats computation is available (R, scipy). 

       
       
       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lambu_atta Uday Babbar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: