Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
Recently a lot of nodes ended up having their shuffle data wiped during an NM upgrade. It took many of the TEZ jobs far too long to recover. This should be something that can be quickly recovered. The NM is returning an error code indicating the shuffle data was not found, and that alone is sufficient evidence to know that no amount of retries is likely to fix the issue. As soon as the NM reports shuffle data as not found, the task should report the not found error to the AM and the AM should treat even a single not found error as sufficient cause to re-run the upstream task.
Attachments
Issue Links
- duplicates
-
TEZ-4233 Map task should be blamed earlier for local fetch failures
- Resolved