Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.1.1, 3.5.0
-
None
Description
we encounter that spark driver hangs for about 11 hours, and finall killed by user. In the driver log there is an error log:
16:42:40 151 ERROR (org.apache.spark.rpc.netty.Inbox:94) - An error happened while processing message in the inbox for CoarseGrainedScheduler
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:719)
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:61)
at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:769)
at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:745)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:144)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
After detailed analysis, we found that, the driver submitted task 0.0 at "16:40:50" to executor 4, and executor 4 finished the task 0.0 at "16:42:39", then executor 4 sent results to the driver. But in the same time, there is not sufficient memory in the the server that running the driver, the driver "unable to create new native thread" to handle the successful result of task 0.0, then the driver think task 0.0 has not finished and waiting for the "missed result" forever.
driver submit task 0.0
executor 4 task 0.0
oom-killer:
Attachments
Attachments
Issue Links
- links to