Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29818 Missing persist on RDD
  3. SPARK-29810

Missing persist on retaggedInput in RandomForest.run()

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • ML
    • None

    Description

      The rdd retaggedInput should be persisted in ml.tree.impl.RandomForest.run(), because it will be used more than one actions.

        def run(
            input: RDD[LabeledPoint],
            strategy: OldStrategy,
            numTrees: Int,
            featureSubsetStrategy: String,
            seed: Long,
            instr: Option[Instrumentation],
            prune: Boolean = true, // exposed for testing only, real trees are always pruned
            parentUID: Option[String] = None): Array[DecisionTreeModel] = {
      
          val timer = new TimeTracker()
          timer.start("total")
          timer.start("init")
          val retaggedInput = input.retag(classOf[LabeledPoint]) // it needs to be persisted
      

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              spark_cachecheck IcySanwitch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: