Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
java.io.FileNotFoundException: /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f (No such file or directory) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
Local or temp shuffle files are initialized without checking because the getFile method in DiskBlockManager probably return an existing subdirectory. Sometimes, when a disk failure occurs, those files may become inaccessible and throw FileNotFoundException later, which may fail the entire task. Task re-running is a bit heavy for these errors, we may give another or more disks a try at least.
Attachments
Issue Links
- breaks
-
SPARK-31314 Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
- Resolved
- links to