[SPARK-26738] Pyspark random forest classifier feature importance with column names - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: ML
Labels:
- RandomForest
- pyspark

Description

I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark.

The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors.

I have included all the stages in a Pipeline

regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature)
idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col)
indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name)
converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels)
feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer])
estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100)
pipeline = Pipeline(stages=[feature_pipeline, estimator, converter])
model = pipeline.fit(df)

Generating the feature importances as

rdc = model.stages[-2]
print (rdc.featureImportances)

So far so good, but when i try to map the feature importances to the feature columns as below

attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values())))

[(name, rdc.featureImportances[idx])
   for idx, name in attrs
   if dtModel_1.featureImportances[idx]]

I get the key error on ml_attr

KeyError: 'ml_attr'

The printed the dictionary,

print (df_pred.schema["featurescol"].metadata)

and it's empty {}

Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names.

Thanks

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Praveen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jan/19 11:12

Updated:: 12/Dec/22 18:10

Resolved:: 28/Jan/19 10:20