Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28849

Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.3.1
    • None
    • Spark Core
    • None

    Description

      Spark's UnsafeShuffleWriter may run into infinite loop when calling transferTo occasionally. What we saw is that when merging shuffle temp file, the task is hung for several hours until it is killed manually. Here's the log you can see, there's no any log after spilling the shuffle data to disk, but the executor is still alive.

      And here is the thread dump, we could see that it always calls native method size0.

      And we use strace to trace the system call, we found that this thread is always calling fstat, and the system usage is pretty high, here is the screenshot.

      We didn't find the root cause here, I guess it might be related to FS or disk issue. Anyway we should figure out a way to fail fast in a such scenario.

      Attachments

        1. D18F4.png
          109 kB
          Saisai Shao
        2. 95330.png
          168 kB
          Saisai Shao
        3. 91ADA.png
          53 kB
          Saisai Shao

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jerryshao Saisai Shao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: