Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29827

Wrong persist strategy in mllib.clustering.BisectingKMeans.run

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • MLlib
    • None

    Description

      There are three persist misuses in mllib.clustering.BisectingKMeans.run.

      • First, the rdd input should be persisted, because it was not only used by the action first(), but also used by other __ actions in the following code.
      • Second, the rdd assignments should be persisted. It was used in the fuction summarize() more than once, which containts an action on assignments.
      • Third, once the rdd assignments is persisted_,_ persisting the rdd norms would be unnecessary. Because norms  is an intermediate rdd. Since its child rdd assignments is persisted, it is unnecessary to persist norms anymore.
        private[spark] def run(
            input: RDD[Vector],
            instr: Option[Instrumentation]): BisectingKMeansModel = {
          if (input.getStorageLevel == StorageLevel.NONE) {
            logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
              + " its parent RDDs are also not cached.")
          }
          // Needs to persist input
          val d = input.map(_.size).first() 
          logInfo(s"Feature dimension: $d.")
          val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure)
          // Compute and cache vector norms for fast distance computation.
          val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
          val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
          var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
          var activeClusters = summarize(d, assignments, dMeasure)
      

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              spark_cachecheck IcySanwitch
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: