Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29856

Conditional unnecessary persist on RDDs in ML algorithms

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.0.0
    • None
    • ML, MLlib
    • None

    Description

      When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD baggedInput in ml.tree.impl.RandomForest.run() is persisted, but it only used once. So this persist operation is unnecessary.

          val baggedInput = BaggedPoint
            .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, withReplacement,
              (tp: TreePoint) => tp.weight, seed = seed)
            .persist(StorageLevel.MEMORY_AND_DISK)
            ...
         while (nodeStack.nonEmpty) {
            ...
            timer.start("findBestSplits")
            RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, nodesForGroup,
              treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
            timer.stop("findBestSplits")
          }
          baggedInput.unpersist()
      

      However, the action on baggedInput is in a while loop.
      In GradientBoostedTreeRegressorExample, this loop only executes once, so only one action uses baggedInput.
      In most of ML applications, the loop will executes for many times, which means baggedInput will be used in many actions. So the persist is necessary now.
      That's the point why the persist operation is "conditional" unnecessary.

      Same situations exist in many other ML algorithms, e.g., RDD instances in ml.clustering.KMeans.fit(), RDD indices in mllib.clustering.BisectingKMeans.run().

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              spark_cachecheck IcySanwitch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: