Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
YARN 2.7.3+
Description
REEF-on-REEF application runs on YARN, and the inner application completes successfully; however, the host application's driver closes prematurely and has the FAILED/FAILED status in RM:
$ yarn application -list -appStates ALL
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1492554568254_0013 REEF-on-REEF:host YARN hadoop root.hadoop FAILED FAILED 100% http://cisl-linux-070:8088/cluster/app/application_1492554568254_0013
application_1492554568254_0014 REEF-on-REEF:hello YARN hadoop root.hadoop FINISHED SUCCEEDED 100% N/A
Most likely, that happens because on completion the inner application closes some resources that either belong to the host app, or are shared with it.
Here's a fragment of the dirver log:
2017-04-18 19:15:52,332 INFO reef.examples.reefonreef.ReefOnReefDriver.onNext main | REEF-on-REEF inner job application_1492554568254_0014 completed: state DONE 2017-04-18 19:15:52,332 FINER reef.runtime.common.REEFEnvironment.close main | ENTRY 2017-04-18 19:15:52,332 FINER reef.wake.time.runtime.RuntimeClock.close main | ENTRY 2017-04-18 19:15:52,332 FINER reef.wake.time.runtime.RuntimeClock.close main | RETURN Clock has already been closed 2017-04-18 19:15:52,332 FINER reef.runtime.common.launch.REEFErrorHandler.close main | ENTRY 2017-04-18 19:15:52,332 FINER reef.runtime.common.utils.RemoteManager.close main | ENTRY 2017-04-18 19:15:52,332 FINE reef.wake.remote.impl.DefaultRemoteManagerImplementation.close main | RemoteManager: REEF_UNMANAGED_DRIVER Closing remote manager id: socket://10.200.91.65:16952 2017-04-18 19:15:52,332 FINE reef.wake.remote.impl.DefaultRemoteManagerImplementation.close main | RemoteManager: REEF_UNMANAGED_DRIVER already closed 2017-04-18 19:15:52,332 FINER reef.runtime.common.utils.RemoteManager.close main | RETURN 2017-04-18 19:15:52,332 FINER reef.runtime.common.launch.REEFErrorHandler.close main | RETURN 2017-04-18 19:15:52,332 FINER reef.runtime.common.REEFEnvironment.close main | RETURN 2017-04-18 19:15:52,332 INFO reef.examples.reefonreef.ReefOnReefDriver.onNext main | REEF-on-REEF host job REEF-on-REEF:host completed: inner app application_1492554568254_0014 status SUBMITTED
i.e. some driver resources has already been closed at the end of the inner app.
Another good test for that behavior would be running two inner applications in Unmanaged AM mode sequentially from the same host driver.