Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31140

Support Quick sample in RDD

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 3.1.0
    • None
    • Spark Core
    • None

    Description

      RDD.sample use the function of filter to pick up the data we need. It means that if the raw data is very huge, we must spend too much time reading it. We can filter the raw partition to speed up the processing of sample.

        override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
          val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
          val thisSampler = sampler.clone
          thisSampler.setSeed(split.seed)
          thisSampler.sample(firstParent[T].iterator(split.prev, context))
        }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            deshanxiao Deshan Xiao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: