[MAPREDUCE-6135] Job staging directory remains if MRAppMaster is OOM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

If MRAppMaster attempts run out of memory, it won't go through the normal job clean up process to move history files to history server location. When customers try to find out why the job failed, the data won't be available on history server webUI.

The work around is to extract the container id and NM id from the jhist file in the job staging directory; then use "yarn logs" command to get the AM logs.

It would be great the platform can take care of it by moving these hist files automatically to history server if AM attempts don't exit properly.

We discuss ideas on how to address this and would like get suggestions from others. Not sure if timeline server design covers this scenario.

1. Define some protocol for YARN to tell AppMaster "you have exceeded AM max attempt, please clean up". For example, YARN can launch AppMaster one more time after AM max attempt and MRAppMaster use that as the indication this is clean-up-only attempt.

2. Have some program periodically check job statuses and move files from job staging directory to history server for those finished jobs.

Attachments

Issue Links

duplicates

MAPREDUCE-5502 History link in resource manager is broken for KILLED jobs

Open

Activity

People

Assignee:: Unassigned

Reporter:: Ming Ma

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 23/Oct/14 01:07

Updated:: 23/Oct/14 21:36

Resolved:: 23/Oct/14 21:36