Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42923

Delayed scheduling doesn’t work in some situations in local mode if different localities present in loaded files leading to tasks getting stuck

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.2
    • None
    • Scheduler
    • None

    Description

      I stumbled on the following issue when running spark in local mode where part of the loaded files were present in the same host as the spark and others not.

      Symptom: Some task in larger job would consistently get stuck without no immediately clear errors in logs. My hope/expectation would have been that even if some tasks would have failed to complete on during some expected time the job would have retried the task or failed completely with some exception and not just get stuck forever.

      Workaround:
      Setting spark.locality.wait.node to 0s seemed to fix the getting stuck in my environment.

      Potential root cause:
      I managed to reproduce the issue with the spark codebase by adding a test case to FileSourceStrategySuite, which is trying to read two files to a table where another is located in the same host as the local spark executor and another in some other host. https://github.com/apache/spark/commit/c23db78863c7342ae7b7bc3922a200a523e45538

      While digging into the issue with the debugger I finally noticed that the LocalSchedulerBackend is missing the reviveThread present in CoarseGrainedSchedulerBackend, which forces the periodic calling of resourceOffsers in TaskSchedulerImpl and not just in taskUpdates.

      Potential fix:
      Add the revive thread also to LocalSchedulerBackend.
      I don’t really have understanding of the codebase whether simply adding the revive thread to LocalSchedulerBackend could have some unwanted side effects.

      Questions/Observations:
      Should delayed scheduling work at all in local mode?
      This issue probably effect also the case where instead of local file there is file which is rack local to the executor and then some non rack local file, which are being loaded.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dolmio Juho Salmio
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: