[SPARK-29813] Missing persist in mllib.PrefixSpan.findFrequentItems() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: MLlib
Labels:
None

Description

There are three actions in this piece of code: reduceByKey, sortBy, and collect. But data is not persisted, which will cause recomputation.

  private[fpm] def findFrequentItems[Item: ClassTag](
      data: RDD[Array[Array[Item]]],
      minCount: Long): Array[Item] = {

    data.flatMap { itemsets =>
      val uniqItems = mutable.Set.empty[Item]
      itemsets.foreach(set => uniqItems ++= set)
      uniqItems.toIterator.map((_, 1L))
    }.reduceByKey(_ + _).filter { case (_, count) =>
      count >= minCount
    }.sortBy(-_._2).map(_._1).collect()
  }

This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

Attachments

Issue Links

duplicates

SPARK-29818 Missing persist on RDD

Resolved

links to

GitHub Pull Request #26452

Activity

People

Assignee:: Unassigned

Reporter:: IcySanwitch

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 09/Nov/19 12:52

Updated:: 10/Nov/19 19:22

Resolved:: 10/Nov/19 19:20