Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
2.4.3
-
None
-
None
Description
There are three persist misuses in mllib.clustering.BisectingKMeans.run.
- First, the rdd input should be persisted, because it was not only used by the action first(), but also used by other __ actions in the following code.
- Second, the rdd assignments should be persisted. It was used in the fuction summarize() more than once, which containts an action on assignments.
- Third, once the rdd assignments is persisted_,_ persisting the rdd norms would be unnecessary. Because norms is an intermediate rdd. Since its child rdd assignments is persisted, it is unnecessary to persist norms anymore.
private[spark] def run( input: RDD[Vector], instr: Option[Instrumentation]): BisectingKMeansModel = { if (input.getStorageLevel == StorageLevel.NONE) { logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if" + " its parent RDDs are also not cached.") } // Needs to persist input val d = input.map(_.size).first() logInfo(s"Feature dimension: $d.") val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure) // Compute and cache vector norms for fast distance computation. val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK) // Unnecessary persist val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) } var assignments = vectors.map(v => (ROOT_INDEX, v)) // Needs to persist var activeClusters = summarize(d, assignments, dMeasure)
This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
Attachments
Issue Links
- duplicates
-
SPARK-29818 Missing persist on RDD
- Resolved