Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.2.2
-
None
-
None
Description
Related cluster configuration:
- MAX_FETCH_FAILURES_NOTIFICATIONS is 3
- NodeManager recovery is disabled
Bug scenario:
- submit a wordcount job which contains 2 simple map tasks (map_0 and map_1) and 1 simple reduce task (reduce_0);
- all map tasks were finished successfully and the AppMaster was notified;
- the NodeManager which runs the map task map_1 crashes;
- the AppMaster schedules a reduce attempt;
- the reduce attempt sends statusUpdate message to AppMaster to notify a fetch failure;
- the reduce attempt fails due to Shuffle$ShuffleError which was caused by java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out;
- the reduce attempt send message fatalError to AppMaster
- the AppMaster successively reschedules another three reduce attempts, but all of them were failed due to Shuffle$ShuffleError;
- AppMaster fails the wordcount job due to the failed reduce task;
- AppMaster receives three statusUpdate messages that state a fetch failure like the message in step 5, but it has already failed the job and would not rerun the task map_1.