Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30202

impl QuantileTransform

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 3.1.0
    • None
    • ML, PySpark
    • None

    Description

      Recently, I encountered some practice senarinos to map the data to another distribution.

      Then I found that QuantileTransformer in sklearn is what I needed, I locally fitted a model on sampled dataset and broadcast it to transform the whole dataset in pyspark.

      After that I impled QuantileTransform as a new Estimator atop Spark, the impl followed scikit-learn' s impl, however there still are sereral differences:

      1, use QuantileSummaries for approximation, no matter the size of dataset;

      2, use linear interpolate, the logic is similar to existing IsotonicRegression, while scikit-learn use a bi-directional interpolate;

      3, when skipZero=true, treat sparse vectors just like dense ones, while scikit-learn have two different logics for sparse and dense datasets.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              podongfeng Ruifeng Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: