Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26579

SparkML DecisionTree, how does the algorithm identify categorical features?

    XMLWordPrintableJSON

Details

    • Question
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.4.0
    • None
    • ML
    • None
    • os: Centos7

      software: pyspark.

    Description

      I am confused about the decision tree and other tree based models. My current project involves data with both nominal and continuous features. I have converted the nominal data to continuous values using the StringIndexer transformer from the ml.feature module. Then I vector assembled all the feature values into a vector type column named features. The feature vector, as I see it, are all double datatype.

      While I keep getting the maxBins should be larger than the largest number for all categorical features error, as I correct the maxBins size, I still see some features (continuous type since the beginning) having the bigger than my maxBins size values. Since the pipeline works with correct maxBins that is not bigger than some continuous values, I should be able to say that the algorithm automatically pick which features are categorical and which ones are continuous. But how did it figure out which is which, as all of the features are of double datatype?

      Another question, if anyone can help, what is the tree type for spark decision tree. Is it CART or else?

      Last question, what are the procedures for treating categorical features in tree based algorithms.

      Thank you in advance.

      Attachments

        Activity

          People

            Unassigned Unassigned
            MayFunNow Xufeng Wang
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: