Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5688

Splits for Categorical Variables in DecisionTrees

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 1.2.0
    • None
    • MLlib
    • Any

    Description

      The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes.
      Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits).

      Attachments

        Activity

          People

            Unassigned Unassigned
            edenovit Eric Denovitzer
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: