Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-7349

An unexpected node crash and delayed messages would fail the job

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.2
    • None
    • applicationmaster
    • None

    Description

      Related cluster configuration:

      • MAX_FETCH_FAILURES_NOTIFICATIONS is 3
      • NodeManager recovery is disabled

      Bug scenario:

      1. submit a wordcount job which contains 2 simple map tasks (map_0 and map_1) and 1 simple reduce task (reduce_0);
      2. all map tasks were finished successfully and the AppMaster was notified;
      3. the NodeManager which runs the map task map_1 crashes;
      4. the AppMaster schedules a reduce attempt;
      5. the reduce attempt sends statusUpdate message to AppMaster to notify a fetch failure;
      6. the reduce attempt fails due to Shuffle$ShuffleError which was caused by java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out;
      7. the reduce attempt send message fatalError to AppMaster
      8. the AppMaster successively reschedules another three reduce attempts, but all of them were failed due to Shuffle$ShuffleError;
      9. AppMaster fails the wordcount job due to the failed reduce task;
      10. AppMaster receives three statusUpdate messages that state a fetch failure like the message in step 5, but it has already failed the job and would not rerun the task map_1.
         
         

      Attachments

        Activity

          People

            Unassigned Unassigned
            gy_way May
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: