Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28849

Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Spark's UnsafeShuffleWriter may run into infinite loop when calling transferTo occasionally. What we saw is that when merging shuffle temp file, the task is hung for several hours until it is killed manually. Here's the log you can see, there's no any log after spilling the shuffle data to disk, but the executor is still alive.

      And here is the thread dump, we could see that it always calls native method size0.

      And we use strace to trace the system call, we found that this thread is always calling fstat, and the system usage is pretty high, here is the screenshot.

      We didn't find the root cause here, I guess it might be related to FS or disk issue. Anyway we should figure out a way to fail fast in a such scenario.

        Attachments

        1. 91ADA.png
          53 kB
          Saisai Shao
        2. 95330.png
          168 kB
          Saisai Shao
        3. D18F4.png
          109 kB
          Saisai Shao

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jerryshao Saisai Shao
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: