Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41008

Isotonic regression result differs from sklearn implementation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.3.1
    • 3.4.0
    • MLlib
    • None

    Description

       

      import pandas as pd
      from pyspark.sql.types import DoubleType
      from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
      from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark
      
      # The P(positives | model_score):
      # 0.6 -> 0.5 (1 out of the 2 labels is positive)
      # 0.333 -> 0.333 (1 out of the 3 labels is positive)
      # 0.20 -> 0.25 (1 out of the 4 labels is positive)
      tc_pd = pd.DataFrame({
          "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],         
          "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
          "weight": 1,     }
      )
      
      # The fraction of positives for each of the distinct model_scores would be the best fit.
      # Resulting in the following expected calibrated model_scores:
      # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25]
      
      # The sklearn implementation of Isotonic Regression. 
      from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
      tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight'])
      print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
      # >> sklearn: [0.5 0.5 0.33333333 0.33333333 0.33333333 0.25 0.25 0.25 0.25 ]
      
      # The pyspark implementation of Isotonic Regression. 
      tc_df = spark.createDataFrame(tc_pd)
      tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))
      
      isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight')
      tc_model = isotonic_regressor_pyspark.fit(tc_df)
      tc_pd = tc_model.transform(tc_df).toPandas()
      print("pyspark:", tc_pd['prediction'].values)
      # >> pyspark: [0.5 0.5 0.33333333 0.33333333 0.33333333 0. 0. 0. 0. ]
      
      # The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. 
      
      # Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears. 
      

       

      Attachments

        Activity

          People

            ahmed.mahran Ahmed Mahran
            arne.koopman Arne Koopman
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: