[TEZ-4400] Tez takes a long time to recover from shuffle data not found errors - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Recently a lot of nodes ended up having their shuffle data wiped during an NM upgrade. It took many of the TEZ jobs far too long to recover. This should be something that can be quickly recovered. The NM is returning an error code indicating the shuffle data was not found, and that alone is sufficient evidence to know that no amount of retries is likely to fix the issue. As soon as the NM reports shuffle data as not found, the task should report the not found error to the AM and the AM should treat even a single not found error as sufficient cause to re-run the upstream task.

Attachments

Issue Links

duplicates

TEZ-4233 Map task should be blamed earlier for local fetch failures

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Eric Payne

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Mar/22 21:12

Updated:: 29/Oct/23 18:19

Resolved:: 29/Oct/23 18:19