[MAPREDUCE-7349] An unexpected node crash and delayed messages would fail the job - ASF JIRA

XML

Word

Printable

JSON

Related cluster configuration:

Bug scenario:

submit a wordcount job which contains 2 simple map tasks (map_0 and map_1) and 1 simple reduce task (reduce_0);
all map tasks were finished successfully and the AppMaster was notified;
the NodeManager which runs the map task map_1 crashes;
the AppMaster schedules a reduce attempt;
the reduce attempt sends statusUpdate message to AppMaster to notify a fetch failure;
the reduce attempt fails due to Shuffle$ShuffleError which was caused by java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out;
the reduce attempt send message fatalError to AppMaster
the AppMaster successively reschedules another three reduce attempts, but all of them were failed due to Shuffle$ShuffleError;
AppMaster fails the wordcount job due to the failed reduce task;
AppMaster receives three statusUpdate messages that state a fetch failure like the message in step 5, but it has already failed the job and would not rerun the task map_1.