Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32197

'Spark driver' stays running even though 'spark application' has FAILED

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.6
    • None
    • Scheduler, Spark Core
    • None

    Description

      App failed in 6 minutes, driver has been stuck for > 8 hours. I would expect driver to fail if app fails.

       

      Thread dump from jstack (on the driver pid) attached (j1.out)

      Last part of stdout driver log attached (full log is 23MB, stderr log just has launch command)

      Last part of app logs attached

       

      Can see that "org.apache.spark.util.ShutdownHookManager - Shutdown hook called"  line never appears in the driver log after "org.apache.spark.SparkContext - Successfully stopped SparkContext"

       

      Using spark 2.4.6 with spark standalone mode. spark-submit to REST API (port 6066) in cluster mode was used. Other drivers/apps have worked fine with this setup, just this one getting stuck. My cluster has 1 EC2 dedicated as spark master and 1 Spot EC2 dedicated as spark worker. They can auto heal/spot terminate at any time. From checking aws logs: the worker was terminated at 01:53:38

       

      I think you can replicate this by tearing down worker machine while app is running. You might have to try several times.

       

      Similar to https://issues.apache.org/jira/browse/SPARK-24617 i raised before!

       

      Attachments

        1. applog.txt
          2 kB
          t oo
        2. j1.out
          62 kB
          t oo
        3. driverlog.txt
          292 kB
          t oo
        4. failedapp.png
          72 kB
          t oo
        5. stuckdriver.png
          52 kB
          t oo
        6. failed1.png
          78 kB
          t oo
        7. app_executors.png
          46 kB
          t oo
        8. failed_stages.png
          103 kB
          t oo

        Issue Links

          Activity

            People

              Unassigned Unassigned
              toopt4 t oo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: