[SPARK-10433] Gradient boosted trees: increasing input size in 1.4 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.4.1
Fix Version/s: 1.5.0
Component/s: MLlib
Labels:
None

Description

(Sorry to say I don't have any leads on a fix, but this was reported by three different people and I confirmed it at fairly close range, so think it's legitimate

This is probably best explained in the words from the mailing list thread at http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E . Matt Forbes says:

I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input data, so this is the step I've been using to track the input size changing. Here is what I'm seeing:

count at DecisionTreeMetadata.scala:111 
1. Input Size / Records: 726.1 MB / 1295620 
2. Input Size / Records: 106.9 GB / 64780816 
3. Input Size / Records: 160.3 GB / 97171224 
4. Input Size / Records: 214.8 GB / 129680959 
5. Input Size / Records: 268.5 GB / 162533424 
.... 
Input Size / Records: 1912.6 GB / 1382017686 
....

This step goes from taking less than 10s up to 5 minutes by the 15th or so iteration. I'm not quite sure what could be causing this. I am passing a memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train

Johannes Bauer showed me a very similar problem.

Peter Rudenko offers this sketch of a reproduction:

val boostingStrategy = BoostingStrategy.defaultParams("Classification")
    boostingStrategy.setNumIterations(30)
    boostingStrategy.setLearningRate(1.0)
    boostingStrategy.treeStrategy.setMaxDepth(3)
    boostingStrategy.treeStrategy.setMaxBins(128)
    boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
    boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
    boostingStrategy.treeStrategy.setUseNodeIdCache(true)
    boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
      mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]])

val model = GradientBoostedTrees.train(instances, boostingStrategy)

Attachments

Issue Links

duplicates

SPARK-6684 Add checkpointing to GradientBoostedTrees

Resolved

is duplicated by

SPARK-10616 GradientBoostedTrees stuck with 2958359 features train data

Resolved

SPARK-10629 Gradient boosted trees: mapPartitions input size increasing

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Sean R. Owen

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Sep/15 13:51

Updated:: 24/Mar/16 11:52

Resolved:: 24/Mar/16 11:52