[SPARK-29823] Improper persist strategy in mllib.clustering.KMeans.run() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 3.0.0
Component/s: MLlib
Labels:
None

Description

In mllib.clustering.KMeans.run(), the rdd norms is persisted. But norms only has a single child, i.e., the rdd zippedData which was not persisted. So all the actions that reply on norms also reply on zippedData. The rdd zippedData will be used by multiple times in runAlgorithm(). Therefore zippedData should be persisted, not norms.

  private[spark] def run(
      data: RDD[Vector],
      instr: Option[Instrumentation]): KMeansModel = {
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data is not directly cached, which may hurt performance if its"
        + " parent RDDs are also uncached.")
    }
    // Compute squared norms and cache them.
    val norms = data.map(Vectors.norm(_, 2.0))
    norms.persist() // Unnecessary persist. Only used to generate zippedData.
    val zippedData = data.zip(norms).map { case (v, norm) =>
      new VectorWithNorm(v, norm)
    } // needs to persist
    val model = runAlgorithm(zippedData, instr)
    norms.unpersist() // Change to zippedData.unpersist()

This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

Attachments

Issue Links

links to

GitHub Pull Request #26483

Activity

People

Assignee:: Aman Omer

Reporter:: IcySanwitch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Nov/19 11:57

Updated:: 30/Dec/19 15:20

Resolved:: 13/Nov/19 14:16