Details
Description
The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.
My hunch is that YARN is killing the container before Dispose() finishes. If I had to guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then YARN kills all the containers before they have a chance to dispose.
Whatever the case is, this looks like a race condition. I occasionally do see my Dispose() actions successfully complete.
-----------------------------------------
[1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a for loop that writes a number and waits a second. It is supposed to count to 100, but only counts till 9 before the logs end.
... INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570. Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Control protobuf to remove context MasterRootContext Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Handle Evaluator control message: done_evaluator = null kill_evaluator = null stop_evaluator = null context_control = not null Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Send task control message to ContextManager Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: RemoveContext with id MasterRootContext Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 WARNING: Shutting down an task because the underlying context is being closed. Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Trying to close Task IMRUMaster-0 Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 WARNING: Trying to close an task that is in Done state. Ignored. Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: DefaultContextStopHandler received for context: MasterRootContext Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Context stack is empty, done Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], time stamp=[1492184889561], recoveryFlag =[False]. Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005 INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570. Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005 INFO: Handle Evaluator control message: done_evaluator = not null kill_evaluator = null stop_evaluator = null context_control = null Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005 INFO: Received ACK from Driver, shutting down Evaluator. Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001 INFO: The Evaluator has successfully exited. Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001 INFO: Evaluator is returned from Run() [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 1 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 2 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 3 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 4 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 5 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 6 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 7 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 8 [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014 INFO: IIMRUResultHandler Dispose Countdown: 9 End of LogType:evaluator.stdout
Attachments
Issue Links
- links to