Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24617

Spark driver not requesting another executor once original executor exits due to 'lost worker'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.1.1
    • None
    • Scheduler, Spark Core

    Description


      I am running Spark v2.1.1 in 'standalone' mode (no yarn/mesos) across EC2s. I have 1 master ec2 that acts as the driver (since spark-submit is called on this host), spark.master is setup, deploymode is client (so sparksubmit only returns a ReturnCode to the putty window once it finishes processing). I have 1 worker ec2 that is registered with the spark master. When i run sparksubmit on the master, I can see in the WebUI that executors starting on the worker and I can verify successful completion. However if while the sparksubmit is running and the worker ec2 gets terminated and then new ec2 worker becomes alive 3mins later and registers with the master, I have noticed on the webui that it shows 'cannot find address' in the executor status but the driver keeps waiting forever (2 days later I kill it) or in some cases the driver allocates tasks to the new worker only 5 hours later and then completes! Is there some setting i am missing that would explain this behavior?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              toopt4 t oo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: