Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
3.1.0
-
None
-
None
Description
RDD.sample use the function of filter to pick up the data we need. It means that if the raw data is very huge, we must spend too much time reading it. We can filter the raw partition to speed up the processing of sample.
override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = { val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition] val thisSampler = sampler.clone thisSampler.setSeed(split.seed) thisSampler.sample(firstParent[T].iterator(split.prev, context)) }