Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47279

spark driver process hangs due to "unable to create new native thread"

    XMLWordPrintableJSON

Details

    Description

      we encounter that spark driver hangs for about 11 hours,  and finall killed by user. In the driver log there is an error log: 

      16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler
      java.lang.OutOfMemoryError: unable to create new native thread
              at java.lang.Thread.start0(Native Method)
              at java.lang.Thread.start(Thread.java:719)
              at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
              at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
              at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
              at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
              at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
              at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
              at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
              at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
              at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
              at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
              at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:750)

       

      After detailed analysis, we found that, the driver submitted task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sent results to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever.

       

      driver submit task 0.0

       

      executor 4 task 0.0

       

      oom-killer:

      Attachments

        1. driver_submit_task.png
          304 kB
          TianyiMa
        2. executor_4.png
          593 kB
          TianyiMa
        3. oom-killer.png
          307 kB
          TianyiMa

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tianyima TianyiMa
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: