Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28222

Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Duplicate
    • Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      Feature importance values obtained in a binary classification project outputs different values if 2.3.3 version used or 2.4.0. It happens in Random Forest and GBT. Turns out that values that are equal than sklearn output are from 2.3.3 version. 

      As an example:

      SPARK 2.4
      MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 0.06324418636918638]
      MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 0.055888697733790925]
      MODEL GradientBoostingClassifier [0.0, 0.7555555555555556, 0.24444444444444438, 0.0, 1.4602196686471875e-17, 0.0]

      SPARK 2.3.3
      MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 0.05991085303585305]
      MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 0.055888697733790925]
      MODEL GradientBoostingClassifier [0.0, 0.7555555555555555, 0.24444444444444438, 0.0, 2.4326753518951276e-17, 0.0]

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                eneriwrt eneriwrt
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: