[SPARK-31140] Support Quick sample in RDD - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

RDD.sample use the function of filter to pick up the data we need. It means that if the raw data is very huge, we must spend too much time reading it. We can filter the raw partition to speed up the processing of sample.

  override def compute(splitIn: Partition, context: TaskContext): Iterator[U] = {
    val split = splitIn.asInstanceOf[PartitionwiseSampledRDDPartition]
    val thisSampler = sampler.clone
    thisSampler.setSeed(split.seed)
    thisSampler.sample(firstParent[T].iterator(split.prev, context))
  }

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Deshan Xiao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 13/Mar/20 03:03

Updated:: 12/Dec/22 18:10

Resolved:: 23/Mar/20 06:08