[SPARK-29827] Wrong persist strategy in mllib.clustering.BisectingKMeans.run - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: MLlib
Labels:
None

Description

There are three persist misuses in mllib.clustering.BisectingKMeans.run.

First, the rdd input should be persisted, because it was not only used by the action first(), but also used by other __ actions in the following code.
Second, the rdd assignments should be persisted. It was used in the fuction summarize() more than once, which containts an action on assignments.
Third, once the rdd assignments is persisted_,_ persisting the rdd norms would be unnecessary. Because norms is an intermediate rdd. Since its child rdd assignments is persisted, it is unnecessary to persist norms anymore.

  private[spark] def run(
      input: RDD[Vector],
      instr: Option[Instrumentation]): BisectingKMeansModel = {
    if (input.getStorageLevel == StorageLevel.NONE) {
      logWarning(s"The input RDD ${input.id} is not directly cached, which may hurt performance if"
        + " its parent RDDs are also not cached.")
    }
    // Needs to persist input
    val d = input.map(_.size).first() 
    logInfo(s"Feature dimension: $d.")
    val dMeasure: DistanceMeasure = DistanceMeasure.decodeFromString(this.distanceMeasure)
    // Compute and cache vector norms for fast distance computation.
    val norms = input.map(v => Vectors.norm(v, 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
    val vectors = input.zip(norms).map { case (x, norm) => new VectorWithNorm(x, norm) }
    var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
    var activeClusters = summarize(d, assignments, dMeasure)

This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

Attachments

Issue Links

duplicates

SPARK-29818 Missing persist on RDD

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: IcySanwitch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Nov/19 13:34

Updated:: 16/Nov/19 20:49

Resolved:: 16/Nov/19 20:49