Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1778

The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.16
    • IMRU, REEF.NET
    • None
    • YARN. Does not affect the local runtime.

    Description

      The IIMRUResultHandler Dispose() method isn't allowed to complete before the job finishes when run on YARN. When running locally, this isn't an issue. I have not tested a Mesos runtime.

      My hunch is that YARN is killing the container before Dispose() finishes. If I had to guess, I think that the driver calls Dispose() is on the UpdateTaskHost (which calls it on the IIMRUResultHandler), and then the driver tells YARN that the entire job is done, and then YARN kills all the containers before they have a chance to dispose.

      Whatever the case is, this looks like a race condition. I occasionally do see my Dispose() actions successfully complete.

      -----------------------------------------
      [1] Log showing that IIMRUResultHandler.Dispose() did not finish. Here Dispose() is a for loop that writes a number and waits a second. It is supposed to count to 100, but only counts till 9 before the logs end.

      ...
      INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
      Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Control protobuf to remove context MasterRootContext
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Handle Evaluator control message:  done_evaluator = null kill_evaluator = null stop_evaluator = null context_control = not null
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Send task control message to ContextManager
      Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: RemoveContext with id MasterRootContext
      Org.Apache.REEF.Common.Runtime.Evaluator.Context.ContextRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      WARNING: Shutting down an task because the underlying context is being closed.
      Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Trying to close Task IMRUMaster-0
      Org.Apache.REEF.Common.Runtime.Evaluator.Task.TaskRuntime Warning: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      WARNING: Trying to close an task that is in Done state. Ignored.
      Org.Apache.REEF.Common.Context.Defaults.DefaultContextStopHandler Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: DefaultContextStopHandler received for context: MasterRootContext
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Context stack is empty, done
      Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2017-04-14T15:48:09.5616766+00:00 0020
      INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[DONE], context_status=[], time
      stamp=[1492184889561], recoveryFlag =[False].
      Org.Apache.REEF.Common.Runtime.Evaluator.ReefMessageProtoObserver Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
      INFO: receive a ReefMessage from Org.Apache.REEF.Common.Protobuf.ReefProtocol.REEFMessage Driver at socket://100.119.200.14:9570.
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
      INFO: Handle Evaluator control message:  done_evaluator = not null kill_evaluator = null stop_evaluator = null context_control = null
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0005
      INFO: Received ACK from Driver, shutting down Evaluator.
      Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorExitLogger Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
      INFO: The Evaluator has successfully exited.
      Org.Apache.REEF.Evaluator.Evaluator Information: 0 : 2017-04-14T15:48:09.5929225+00:00 0001
      INFO: Evaluator is returned from Run()
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:10.0460051+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 1
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:11.0615866+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 2
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:12.0771779+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 3
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:13.0927401+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 4
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:14.1083243+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 5
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:15.1239021+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 6
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:16.1394801+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 7
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:17.1550601+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 8
      [name-of-class-deleted]`1 Information: 0 : 2017-04-14T15:48:18.1706423+00:00 0014
      INFO: IIMRUResultHandler Dispose Countdown: 9
      End of LogType:evaluator.stdout
      

      Attachments

        Issue Links

          Activity

            People

              markus.weimer Markus Weimer
              rogan Rogan Carr
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: