[SPARK-24609] PySpark/SparkR doc doesn't explain RandomForestClassifier.featureSubsetStrategy well - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.4.0
Component/s: PySpark
Labels:
None

Target Version/s:

2.4.0

Description

In Scala doc (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier), we have:

The number of features to consider for splits at each tree node. Supported options:

"auto": Choose automatically for task: If numTrees == 1, set to "all." If numTrees > 1 (forest), set to "sqrt" for classification and to "onethird" for regression.

"all": use all features

"onethird": use 1/3 of the features

"sqrt": use sqrt(number of features)

"log2": use log2(number of features)

"n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features. (default = "auto")

These various settings are based on the following references:

log2: tested in Breiman (2001)

sqrt: recommended by Breiman manual for random forests

The defaults of sqrt (classification) and onethird (regression) match the R randomForest package.

The entire paragraph is missing in PySpark doc (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#pyspark.ml.classification.RandomForestClassifier.featureSubsetStrategy). And same issue for SparkR (https://github.com/apache/spark/blob/master/R/pkg/R/mllib_tree.R#L365).

Attachments

Issue Links

links to

[Github] Pull Request #21788 (zhengruifeng)

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Xiangrui Meng

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Jun/18 15:18

Updated:: 31/Jul/18 18:37

Resolved:: 31/Jul/18 18:37