Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
2.3.1
-
None
-
None
Description
Spark's UnsafeShuffleWriter may run into infinite loop when calling transferTo occasionally. What we saw is that when merging shuffle temp file, the task is hung for several hours until it is killed manually. Here's the log you can see, there's no any log after spilling the shuffle data to disk, but the executor is still alive.
And here is the thread dump, we could see that it always calls native method size0.
And we use strace to trace the system call, we found that this thread is always calling fstat, and the system usage is pretty high, here is the screenshot.
We didn't find the root cause here, I guess it might be related to FS or disk issue. Anyway we should figure out a way to fail fast in a such scenario.
Attachments
Attachments
Issue Links
- links to