[SPARK-26579] SparkML DecisionTree, how does the algorithm identify categorical features? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 2.4.0
Fix Version/s: None
Component/s: ML
Labels:
None
Environment:

os: Centos7

software: pyspark.

Description

I am confused about the decision tree and other tree based models. My current project involves data with both nominal and continuous features. I have converted the nominal data to continuous values using the StringIndexer transformer from the ml.feature module. Then I vector assembled all the feature values into a vector type column named features. The feature vector, as I see it, are all double datatype.

While I keep getting the maxBins should be larger than the largest number for all categorical features error, as I correct the maxBins size, I still see some features (continuous type since the beginning) having the bigger than my maxBins size values. Since the pipeline works with correct maxBins that is not bigger than some continuous values, I should be able to say that the algorithm automatically pick which features are categorical and which ones are continuous. But how did it figure out which is which, as all of the features are of double datatype?

Another question, if anyone can help, what is the tree type for spark decision tree. Is it CART or else?

Last question, what are the procedures for treating categorical features in tree based algorithms.

Thank you in advance.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xufeng Wang

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 09/Jan/19 06:58

Updated:: 12/Dec/22 18:10

Resolved:: 10/Jan/19 03:26