Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29818 Missing persist on RDD
  3. SPARK-29812

Missing persist on predictionAndLabels in MulticlassClassificationEvaluator

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersConvert to IssueMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • ML
    • None

    Description

      The rdd predictionAndLabels in ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. When MulticlassMetrics uses predictionAndLabels to initialize fileds, there will be at least five actions executed on predictionAndLabels.

        override def evaluate(dataset: Dataset[_]): Double = {
          val schema = dataset.schema
          SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
          SchemaUtils.checkNumericType(schema, $(labelCol))
          // Needs to be persisted
          val predictionAndLabels =
            dataset.select(col($(predictionCol)), col($(labelCol)).cast(DoubleType)).rdd.map {
              case Row(prediction: Double, label: Double) => (prediction, label)
            }
          // The initialization will use predictionAndLabels multi times in different actions.
          val metrics = new MulticlassMetrics(predictionAndLabels)
          val metric = $(metricName) match {
            case "f1" => metrics.weightedFMeasure
            case "weightedPrecision" => metrics.weightedPrecision
            case "weightedRecall" => metrics.weightedRecall
            case "accuracy" => metrics.accuracy
          }
          metric
        }
      

      This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            spark_cachecheck IcySanwitch
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment