Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4400

Tez takes a long time to recover from shuffle data not found errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Recently a lot of nodes ended up having their shuffle data wiped during an NM upgrade. It took many of the TEZ jobs far too long to recover. This should be something that can be quickly recovered. The NM is returning an error code indicating the shuffle data was not found, and that alone is sufficient evidence to know that no amount of retries is likely to fix the issue. As soon as the NM reports shuffle data as not found, the task should report the not found error to the AM and the AM should treat even a single not found error as sufficient cause to re-run the upstream task.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              epayne Eric Payne
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: