Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45154

Pyspark DecisionTreeClassifier: results and tree structure in spark3 very different from that of the spark2 version on the same data and with the same hyperparameters.

    XMLWordPrintableJSON

Details

    Description

      Hello,
      I have an engine running on spark2 using a DecisionTreeClassifier model using the CrossValidator. 

       

      dt  = DecisionTreeClassifier(maxBins=10000, seed=0)   
      cv_dt_evaluator = BinaryClassificationEvaluator(
                  metricName="", 
                  rawPredictionCol="probability")
      
      # Create param grid and cross validator for model selection
      dt_grid = ParamGridBuilder()\
                  .addGrid(
                      dt.minInstancesPerNode, [100]
              )\
                  .addGrid(
                      dt.maxDepth, [10]
              )\
                  .build()
      cv = CrossValidator(
                  estimator=dt, estimatorParamMaps=dt_grid, evaluator=cv_dt_evaluator,
                  parallelism=4
                  numFolds=4
              )

       

      I want to migrate from spark2  to spark3. I've run DecisionTreeClassifier on the same data with the same parameter values. But unfortunately my results are completely different, especially in terms of tree structure. I have trees with less depth and fewer splits on spark3. I've tried to read the documentation but I haven't found an answer to my question.

       

      Can you help me find a solution to this problem?

      Thanks in advance for your help 

              

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            oumarnour Oumar Nour
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: