Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26738

Pyspark random forest classifier feature importance with column names

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.3.2
    • None
    • ML

    Description

      I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark.

      The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors.

      I have included all the stages in a Pipeline

       

      regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size)
      hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature)
      idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
      indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
      converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels)
      feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
      estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100)
      pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
      model = pipeline.fit(df)
      

      Generating the feature importances as

      rdc = model.stages[-2]
      print (rdc.featureImportances)
      

      So far so good, but when i try to map the feature importances to the feature columns as below

      attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values())))
      
      [(name, rdc.featureImportances[idx])
         for idx, name in attrs
         if dtModel_1.featureImportances[idx]]

       

      I get the key error on ml_attr

      KeyError: 'ml_attr'

      The printed the dictionary,

      print (df_pred.schema["featurescol"].metadata)

      and it's empty {}

      Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names.

      Thanks

      Attachments

        Activity

          People

            Unassigned Unassigned
            praveen049 Praveen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: