Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32197

'Spark driver' stays running even though 'spark application' has FAILED

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.6
    • None
    • Scheduler, Spark Core
    • None

    Description

      App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.

       

      Thread dump from jstack (on the driver pid) attached (j1.out)

      Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)

      Last part of app logs attached

       

      Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook called"  line never appears in the driver log after "org.apache.spark.SparkContext - Successfully stopped SparkContext"

       

      Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38

       

      I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.

       

      Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!

       

      Attachments

        1. stuckdriver.png
          52 kB
          t oo
        2. j1.out
          62 kB
          t oo
        3. failedapp.png
          72 kB
          t oo
        4. failed1.png
          78 kB
          t oo
        5. failed_stages.png
          103 kB
          t oo
        6. driverlog.txt
          292 kB
          t oo
        7. applog.txt
          2 kB
          t oo
        8. app_executors.png
          46 kB
          t oo

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            toopt4 t oo

            Dates

              Created:
              Updated:

              Slack

                Issue deployment