Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.4.0
-
None
Description
I just took a look at a flaky failure in SchedulerIntegrationSuite https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95887
it seems the timeout really is too short:
18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 8, localhost, executor driver, partition 5, PROCESS_LOCAL, 7677 bytes) 18/09/10 11:14:07.821 task-result-getter-2 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 6) in 1 ms on localhost (executor driver) (4/10) 18/09/10 11:14:07.821 task-result-getter-0 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 7) in 1 ms on localhost (executor driver) (5/10) 18/09/10 11:14:07.821 mock backend thread INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 9, localhost, executor driver, partition 6, PROCESS_LOCAL, 7677 bytes) 18/09/10 11:14:07.821 task-result-getter-1 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 8) in 0 ms on localhost (executor driver) (6/10) 18/09/10 11:14:09.481 mock backend thread INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 10, localhost, executor driver, partition 7, PROCESS_LOCAL, 7677 bytes) 18/09/10 11:14:09.482 dispatcher-event-loop-14 INFO BlockManagerInfo: Removed broadcast_0_piece0 on amp-jenkins-worker-05.amp:36913 in memory (size: 1260.0 B, free: 1638.6 MB)
you'll see that the "mock backend thread" does keep making progress, but for whatever reason there is over a one second delay in the middle. Thats already going over the existing timeouts.
Its possible there is something else going on here, but for now just increasing the timeouts seems like the best next step.
Attachments
Issue Links
- is related to
-
SPARK-43587 Run HealthTrackerIntegrationSuite in a dedicate JVM
- Resolved
- links to