Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29818 Missing persist on RDD
  3. SPARK-29814

Missing persist on sources in mllib.feature.PCA

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.4.3
    • None
    • MLlib
    • None

    Description

      The rdd is used in more than one actions: first() and actions in computePrincipalComponentsAndExplainedVariance(), so it needs to be persisted.

        def fit(sources: RDD[Vector]): PCAModel = {
          // first use rdd sources on action first()
          val numFeatures = sources.first().size
          require(k <= numFeatures,
            s"source vector size $numFeatures must be no less than k=$k")
          require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
            "The param k and numFeatures is too large for SVD computation. " +
            "Try reducing the parameter k for PCA, or reduce the input feature " +
            "vector dimension to make this tractable.")
      
          val mat = new RowMatrix(sources)
          // second use rdd sources
          val (pc, explainedVariance) = mat.computePrincipalComponentsAndExplainedVariance(k)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              spark_cachecheck IcySanwitch
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: