[SPARK-32735] RDD actions in DStream.transfrom don't show at batch page - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: DStreams, Web UI
Labels:
None

Docs Text:
Fix RDD actions in DStream.transfrom don't show at batch page

Description

Issue

val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val mappedStream= words.transform(rdd => {
  val c = rdd.count();
  rdd.map(x => s"$c x")}
)
mappedStream.foreachRDD(rdd => rdd.foreach(x => println(x)))

Every batch two spark jobs are created. Only the second one is associated with the streaming output operation and shows at batch page.

Investigation

The first action rdd.count() is invoked by JobGenerator.generateJobs. Batch time and output op id are not available in spark context because they are set in JobScheduler later.

Proposal

delegate dstream.getOrCompute to JobScheduler so that all rdd actions can run in spark context with correct local properties.

Attachments

Issue Links

links to

[Github] Pull Request #29578 (Olwn)

Activity

People

Assignee:: Unassigned

Reporter:: Liechuan Ou

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 29/Aug/20 14:43

Updated:: 03/Sep/20 00:03