[OOZIE-2326] oozie/yarn/spark: active container remains after failed job - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.1.0
Fix Version/s: None
Component/s: workflow
Labels:
None
Environment:

pseudo-distributed (single VM), CentOS 6.6, CDH 5.4.3

Description

Issue occurs when I launch a Spark job (local mode) that fails. (My example failed because I tried to read a non-existent file). When this occur, the job fails, and YARN ends up in a weird state: the RM manager shows the launch job has completed...but a container for the job is still live on the slave node. Because I'm running in pseudo-dist mode, this totally hangs my cluster: no other jobs can run because there are only resources for a single container, and that container is running the dead Oozie launcher.

If I wait long enough, YARN will eventually time out and release the container and start accepting new jobs. But until then I'm dead in the water.

Attaching screen shots that show the state right after running the failed job:
the RM shows no jobs running
the node shows one container running
Also attaching a log file for the oozie job and the container.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

container-logs.txt
07/Aug/15 13:46
45 kB
Diana Carroll
ooziejob-logs.txt
07/Aug/15 13:46
10 kB
Diana Carroll
yarnbug1.png
07/Aug/15 13:46
150 kB
Diana Carroll
yarnbug2.png
07/Aug/15 13:46
58 kB
Diana Carroll

Activity

People

Assignee:: Satish Saley

Reporter:: Diana Carroll

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/15 13:37

Updated:: 13/Apr/16 23:28